Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images
and Recipes with Semantic Consistency and Attention Mechanism
- URL: http://arxiv.org/abs/2003.03955v3
- Date: Tue, 21 Sep 2021 02:17:48 GMT
- Title: Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images
and Recipes with Semantic Consistency and Attention Mechanism
- Authors: Hao Wang, Doyen Sahoo, Chenghao Liu, Ke Shu, Palakorn Achananuparp,
Ee-peng Lim, Steven C. H. Hoi
- Abstract summary: We learn an embedding of images and recipes in a common feature space, such that the corresponding image-recipe embeddings lie close to one another.
We propose Semantic-Consistent and Attention-based Networks (SCAN), which regularize the embeddings of the two modalities through aligning output semantic probabilities.
We show that we can outperform several state-of-the-art cross-modal retrieval strategies for food images and cooking recipes by a significant margin.
- Score: 70.85894675131624
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Food retrieval is an important task to perform analysis of food-related
information, where we are interested in retrieving relevant information about
the queried food item such as ingredients, cooking instructions, etc. In this
paper, we investigate cross-modal retrieval between food images and cooking
recipes. The goal is to learn an embedding of images and recipes in a common
feature space, such that the corresponding image-recipe embeddings lie close to
one another. Two major challenges in addressing this problem are 1) large
intra-variance and small inter-variance across cross-modal food data; and 2)
difficulties in obtaining discriminative recipe representations. To address
these two problems, we propose Semantic-Consistent and Attention-based Networks
(SCAN), which regularize the embeddings of the two modalities through aligning
output semantic probabilities. Besides, we exploit a self-attention mechanism
to improve the embedding of recipes. We evaluate the performance of the
proposed method on the large-scale Recipe1M dataset, and show that we can
outperform several state-of-the-art cross-modal retrieval strategies for food
images and cooking recipes by a significant margin.
Related papers
- RoDE: Linear Rectified Mixture of Diverse Experts for Food Large Multi-Modal Models [96.43285670458803]
Uni-Food is a unified food dataset that comprises over 100,000 images with various food labels.
Uni-Food is designed to provide a more holistic approach to food data analysis.
We introduce a novel Linear Rectification Mixture of Diverse Experts (RoDE) approach to address the inherent challenges of food-related multitasking.
arXiv Detail & Related papers (2024-07-17T16:49:34Z) - From Canteen Food to Daily Meals: Generalizing Food Recognition to More
Practical Scenarios [92.58097090916166]
We present two new benchmarks, namely DailyFood-172 and DailyFood-16, designed to curate food images from everyday meals.
These two datasets are used to evaluate the transferability of approaches from the well-curated food image domain to the everyday-life food image domain.
arXiv Detail & Related papers (2024-03-12T08:32:23Z) - NutritionVerse: Empirical Study of Various Dietary Intake Estimation Approaches [59.38343165508926]
Accurate dietary intake estimation is critical for informing policies and programs to support healthy eating.
Recent work has focused on using computer vision and machine learning to automatically estimate dietary intake from food images.
We introduce NutritionVerse- Synth, the first large-scale dataset of 84,984 synthetic 2D food images with associated dietary information.
We also collect a real image dataset, NutritionVerse-Real, containing 889 images of 251 dishes to evaluate realism.
arXiv Detail & Related papers (2023-09-14T13:29:41Z) - Transferring Knowledge for Food Image Segmentation using Transformers
and Convolutions [65.50975507723827]
Food image segmentation is an important task that has ubiquitous applications, such as estimating the nutritional value of a plate of food.
One challenge is that food items can overlap and mix, making them difficult to distinguish.
Two models are trained and compared, one based on convolutional neural networks and the other on Bidirectional representation for Image Transformers (BEiT)
The BEiT model outperforms the previous state-of-the-art model by achieving a mean intersection over union of 49.4 on FoodSeg103.
arXiv Detail & Related papers (2023-06-15T15:38:10Z) - A Large-Scale Benchmark for Food Image Segmentation [62.28029856051079]
We build a new food image dataset FoodSeg103 (and its extension FoodSeg154) containing 9,490 images.
We annotate these images with 154 ingredient classes and each image has an average of 6 ingredient labels and pixel-wise masks.
We propose a multi-modality pre-training approach called ReLeM that explicitly equips a segmentation model with rich and semantic food knowledge.
arXiv Detail & Related papers (2021-05-12T03:00:07Z) - Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers
and Self-supervised Learning [17.42688184238741]
Cross-modal recipe retrieval has recently gained substantial attention due to the importance of food in people's lives.
We propose a simplified end-to-end model based on well established and high performing encoders for text and images.
Our proposed method achieves state-of-the-art performance in the cross-modal recipe retrieval task on the Recipe1M dataset.
arXiv Detail & Related papers (2021-03-24T10:17:09Z) - CHEF: Cross-modal Hierarchical Embeddings for Food Domain Retrieval [20.292467149387594]
We introduce a novel cross-modal learning framework to jointly model the latent representations of images and text in the food image-recipe association and retrieval tasks.
Our experiments show that by making use of efficient tree-structured Long Short-Term Memory as the text encoder in our computational cross-modal retrieval framework, we are able to identify the main ingredients and cooking actions in the recipe descriptions without explicit supervision.
arXiv Detail & Related papers (2021-02-04T11:24:34Z) - Cross-modal Retrieval and Synthesis (X-MRS): Closing the modality gap in
shared subspace [21.33710150033949]
We propose a simple yet novel architecture for shared subspace learning, which is used to tackle the food image-to-recipe retrieval problem.
Experimental analysis on the public Recipe1M dataset shows that the subspace learned via the proposed method outperforms the current state-of-the-arts.
In order to demonstrate the representational power of the learned subspace, we propose a generative food image synthesis model conditioned on the embeddings of recipes.
arXiv Detail & Related papers (2020-12-02T17:27:00Z) - MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images
with Latent Variable Model [28.649961369386148]
We present Modality-Consistent Embedding Network (MCEN) that learns modality-invariant representations by projecting images and texts to the same embedding space.
Our method learns the cross-modal alignments during training but computes embeddings of different modalities independently at inference time for the sake of efficiency.
arXiv Detail & Related papers (2020-04-02T16:00:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.