Food computing: Domain adaptation and causal inference

This dissertation addresses two challenges in food computing: food recognition and food image-to-recipe retrieval. The main research ideas are: (1) leveraging Large Language Models (LLMs) to augment food image representations to mitigate the combined challenges of domain gaps and data imbalance in f...

Full description

Saved in:
Bibliographic Details
Main Author: WANG, Qing
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/etd_coll/663
https://ink.library.smu.edu.sg/context/etd_coll/article/1661/viewcontent/GPIS_AY2020_PhD_WangQing.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
Description
Summary:This dissertation addresses two challenges in food computing: food recognition and food image-to-recipe retrieval. The main research ideas are: (1) leveraging Large Language Models (LLMs) to augment food image representations to mitigate the combined challenges of domain gaps and data imbalance in fine-grained food recognition; (2) proposing a causal-theory inspired cross-modal representation learning formulation for reducing the bias caused by the emphasis on certain ingredients for cross-modal recipe retrieval; and (3) extending the framework to incorporate multiple confounding factors, particularly ingredients and cooking actions, allows for more comprehensive modeling of the food image-torecipe retrieval problem. We first explore the challenges present in real-world food datasets, such as domain gaps, imbalanceddatadistributions, and visual and subtle visual similarities between dishes. Specifically, food images, typically crawled from the Internet, differ visually from those captured by users in free-living environments. In addition to this domain-shift issue, realworld food datasets often exhibit long-tailed distributions, where certain food categories are under-represented. In addition, dishes from different categories may have subtle variations that are difficult to distinguish visually. To address these challenges in food recognition, a framework empowered by LLMs is proposed. First, an LLM is leveraged to parse food images, generating both food titles and ingredient lists. With LLM-generated texts as the “bridge” between domains, cross-modal alignment can effectively bring the cross-domain visual features closer by aligning them with the textual counterparts to overcome the domain gap. Additionally, the generated ingredient lists capture minor components, making it easier to distinguish fine-grained categories. Finally, the image features are augmented with textual embeddings for recognition. The textual modality alleviates the under-representation of embeddings caused by insufficient image samples, especially in the tail classes. Using this straightforward LLM-based augmentation, this dissertation demonstrates that the proposed approach outperforms existing methods designed for long-tailed data distribution, domain adaptation, and fine-grained classification across two food datasets. Next, the challenge of biased representation learning in image-to-recipe retrieval is addressed. Based on the assumption that the visual appearance of the final dish image can capture all the details described in the corresponding recipe, existing approaches project food images and their associated recipes into a shared embedding space to maximize their pairwise similarity. However, a recipe narrates the entire cooking process, while a food image reflects only the final outcome. This disparity leads current methods to capture dominant visual-text alignments while overlooking subtle variations that are crucial for accurate retrieval. This bias is modeled in representation learning across images and text through a causal perspective, identifying ingredients as one of the bias sources. Specifically, food images tend to visually represent the major ingredients, while minor ingredients like seasonings and sauces may not be visible due to their size, occlusion, or image-capturing conditions. This inconsistency between a recipe and its paired image makes it challenging to learn image representations that account for all ingredient details. As a result, existing methods struggle to capture the subtle variations needed to differentiate between recipes of visually similar dishes. To mitigate this bias, causal theory is applied to remove the spurious correlations introduced by ingredients. Using backdoor adjustment, a causal-informed equation is derived to address the potential bias from ingredients. Next, a debiasing module is implemented to approximate this equation, which is essentially a multi-label ingredient classifier, predicting the distribution of ingredients in an image to adjust the image representation during learning. In addition to ingredients, cooking actions also act as potential confounding factors in image-recipe retrieval. Recipes describe a sequence of preparation steps (e.g., cutting and chopping) and cooking techniques (e.g., frying and grilling), which are not reflected in the f inal foodimage. Thisdiscrepancy, causedbythemissingofthecookingactionprocessinthe visual representation, is often overlooked by existing learning methods, making it difficult to distinguish between recipes for visually similar dishes prepared using different methods. The causal framework is extended to account for both ingredients and cooking actions as confounding factors and apply a backdoor adjustment to mitigate the biases introduced by these elements. Unlike the ingredient debiasing module, which is implemented as a multilabel ingredient classifier, the derived causal-informed equation suggests that ingredients and cooking actions are not independent. Instead, the action debiasing module must model the conditional dependencies between actions and ingredients. Guided by this equation, a more comprehensive action debiasing module is proposed and implemented as a conditional text generator. This model takes food images and ingredients as input to generate a sequence of cooking actions for adjusting the image representation to accurately reflect the cooking process. The dissertation contributes to a deeper understanding of how food datasets can impact recognition and retrieval performances. The causal perspective, particularly, provides a theory-informed upper bound for image-to-recipe retrieval performance. Empirically, nearperfect retrieval results are achieved on the Recipe1M dataset, highlighting that causalitybased representation learning is a promising approach for achieving high-recall retrieval. Moreover, debiasing the representation learning by considering multiple culinary elements together, such as ingredients and cooking actions, improves bias mitigation and retrieval performance.