Related papers: Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval

Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval

URL: http://arxiv.org/abs/2511.15201v1
Date: Wed, 19 Nov 2025 07:39:53 GMT
Title: Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval
Authors: Qing Wang, Chong-Wah Ngo, Ee-Peng Lim,
Abstract summary: This paper addresses the challenges of learning representations for recipes and food images in the cross-modal retrieval problem.<n>As the relationship between a recipe and its cooked dish is cause-and-effect, treating a recipe as a text source will create bias misleading image-and-recipe similarity judgment.<n>We propose a plug-and-play neural module, which is essentially a multi-label ingredient for debiasing.
Score: 33.21317747745805
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper addresses the challenges of learning representations for recipes and food images in the cross-modal retrieval problem. As the relationship between a recipe and its cooked dish is cause-and-effect, treating a recipe as a text source describing the visual appearance of a dish for learning representation, as the existing approaches, will create bias misleading image-and-recipe similarity judgment. Specifically, a food image may not equally capture every detail in a recipe, due to factors such as the cooking process, dish presentation, and image-capturing conditions. The current representation learning tends to capture dominant visual-text alignment while overlooking subtle variations that determine retrieval relevance. In this paper, we model such bias in cross-modal representation learning using causal theory. The causal view of this problem suggests ingredients as one of the confounder sources and a simple backdoor adjustment can alleviate the bias. By causal intervention, we reformulate the conventional model for food-to-recipe retrieval with an additional term to remove the potential bias in similarity judgment. Based on this theory-informed formulation, we empirically prove the oracle performance of retrieval on the Recipe1M dataset to be MedR=1 across the testing data sizes of 1K, 10K, and even 50K. We also propose a plug-and-play neural module, which is essentially a multi-label ingredient classifier for debiasing. New state-of-the-art search performances are reported on the Recipe1M dataset.

Related papers

Mitigating Cross-modal Representation Bias for Multicultural Image-to-Recipe Retrieval [33.17028372962136]
Cross-modal representations to bridge the modality gap between images and recipes tend to ignore subtle recipe-specific details.<n>This paper proposes a novel causal approach that predicts the culinary elements potentially overlooked in images.<n> Experiments are conducted on the standard monolingual Recipe1M dataset and a newly curated multilingual multicultural cuisine dataset.
arXiv Detail & Related papers (2025-10-23T09:43:43Z)
Retrieval Augmented Recipe Generation [96.43285670458803]
We propose a retrieval augmented large multimodal model for recipe generation.<n>It retrieves recipes semantically related to the image from an existing datastore as a supplement.<n>It calculates the consistency among generated recipe candidates, which use different retrieval recipes as context for generation.
arXiv Detail & Related papers (2024-11-13T15:58:50Z)
NutritionVerse: Empirical Study of Various Dietary Intake Estimation Approaches [59.38343165508926]
Accurate dietary intake estimation is critical for informing policies and programs to support healthy eating. Recent work has focused on using computer vision and machine learning to automatically estimate dietary intake from food images. We introduce NutritionVerse- Synth, the first large-scale dataset of 84,984 synthetic 2D food images with associated dietary information. We also collect a real image dataset, NutritionVerse-Real, containing 889 images of 251 dishes to evaluate realism.
arXiv Detail & Related papers (2023-09-14T13:29:41Z)
Transferring Knowledge for Food Image Segmentation using Transformers and Convolutions [65.50975507723827]
Food image segmentation is an important task that has ubiquitous applications, such as estimating the nutritional value of a plate of food. One challenge is that food items can overlap and mix, making them difficult to distinguish. Two models are trained and compared, one based on convolutional neural networks and the other on Bidirectional representation for Image Transformers (BEiT) The BEiT model outperforms the previous state-of-the-art model by achieving a mean intersection over union of 49.4 on FoodSeg103.
arXiv Detail & Related papers (2023-06-15T15:38:10Z)
A Large-Scale Benchmark for Food Image Segmentation [62.28029856051079]
We build a new food image dataset FoodSeg103 (and its extension FoodSeg154) containing 9,490 images. We annotate these images with 154 ingredient classes and each image has an average of 6 ingredient labels and pixel-wise masks. We propose a multi-modality pre-training approach called ReLeM that explicitly equips a segmentation model with rich and semantic food knowledge.
arXiv Detail & Related papers (2021-05-12T03:00:07Z)
Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning [17.42688184238741]
Cross-modal recipe retrieval has recently gained substantial attention due to the importance of food in people's lives. We propose a simplified end-to-end model based on well established and high performing encoders for text and images. Our proposed method achieves state-of-the-art performance in the cross-modal recipe retrieval task on the Recipe1M dataset.
arXiv Detail & Related papers (2021-03-24T10:17:09Z)
Cross-modal Retrieval and Synthesis (X-MRS): Closing the modality gap in shared subspace [21.33710150033949]
We propose a simple yet novel architecture for shared subspace learning, which is used to tackle the food image-to-recipe retrieval problem. Experimental analysis on the public Recipe1M dataset shows that the subspace learned via the proposed method outperforms the current state-of-the-arts. In order to demonstrate the representational power of the learned subspace, we propose a generative food image synthesis model conditioned on the embeddings of recipes.
arXiv Detail & Related papers (2020-12-02T17:27:00Z)
Picture-to-Amount (PITA): Predicting Relative Ingredient Amounts from Food Images [24.26111169033236]
We study the novel and challenging problem of predicting the relative amount of each ingredient from a food image. We propose PITA, the Picture-to-Amount deep learning architecture to solve the problem. Experiments on a dataset of recipes collected from the Internet show the model generates promising results.
arXiv Detail & Related papers (2020-10-17T06:43:18Z)
Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes with Semantic Consistency and Attention Mechanism [70.85894675131624]
We learn an embedding of images and recipes in a common feature space, such that the corresponding image-recipe embeddings lie close to one another. We propose Semantic-Consistent and Attention-based Networks (SCAN), which regularize the embeddings of the two modalities through aligning output semantic probabilities. We show that we can outperform several state-of-the-art cross-modal retrieval strategies for food images and cooking recipes by a significant margin.
arXiv Detail & Related papers (2020-03-09T07:41:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.