MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images
with Latent Variable Model
- URL: http://arxiv.org/abs/2004.01095v1
- Date: Thu, 2 Apr 2020 16:00:10 GMT
- Title: MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images
with Latent Variable Model
- Authors: Han Fu, Rui Wu, Chenghao Liu, Jianling Sun
- Abstract summary: We present Modality-Consistent Embedding Network (MCEN) that learns modality-invariant representations by projecting images and texts to the same embedding space.
Our method learns the cross-modal alignments during training but computes embeddings of different modalities independently at inference time for the sake of efficiency.
- Score: 28.649961369386148
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nowadays, driven by the increasing concern on diet and health, food computing
has attracted enormous attention from both industry and research community. One
of the most popular research topics in this domain is Food Retrieval, due to
its profound influence on health-oriented applications. In this paper, we focus
on the task of cross-modal retrieval between food images and cooking recipes.
We present Modality-Consistent Embedding Network (MCEN) that learns
modality-invariant representations by projecting images and texts to the same
embedding space. To capture the latent alignments between modalities, we
incorporate stochastic latent variables to explicitly exploit the interactions
between textual and visual features. Importantly, our method learns the
cross-modal alignments during training but computes embeddings of different
modalities independently at inference time for the sake of efficiency.
Extensive experimental results clearly demonstrate that the proposed MCEN
outperforms all existing approaches on the benchmark Recipe1M dataset and
requires less computational cost.
Related papers
- RoDE: Linear Rectified Mixture of Diverse Experts for Food Large Multi-Modal Models [96.43285670458803]
Uni-Food is a unified food dataset that comprises over 100,000 images with various food labels.
Uni-Food is designed to provide a more holistic approach to food data analysis.
We introduce a novel Linear Rectification Mixture of Diverse Experts (RoDE) approach to address the inherent challenges of food-related multitasking.
arXiv Detail & Related papers (2024-07-17T16:49:34Z) - From Canteen Food to Daily Meals: Generalizing Food Recognition to More
Practical Scenarios [92.58097090916166]
We present two new benchmarks, namely DailyFood-172 and DailyFood-16, designed to curate food images from everyday meals.
These two datasets are used to evaluate the transferability of approaches from the well-curated food image domain to the everyday-life food image domain.
arXiv Detail & Related papers (2024-03-12T08:32:23Z) - Food Image Classification and Segmentation with Attention-based Multiple
Instance Learning [51.279800092581844]
The paper presents a weakly supervised methodology for training food image classification and semantic segmentation models.
The proposed methodology is based on a multiple instance learning approach in combination with an attention-based mechanism.
We conduct experiments on two meta-classes within the FoodSeg103 data set to verify the feasibility of the proposed approach.
arXiv Detail & Related papers (2023-08-22T13:59:47Z) - Transferring Knowledge for Food Image Segmentation using Transformers
and Convolutions [65.50975507723827]
Food image segmentation is an important task that has ubiquitous applications, such as estimating the nutritional value of a plate of food.
One challenge is that food items can overlap and mix, making them difficult to distinguish.
Two models are trained and compared, one based on convolutional neural networks and the other on Bidirectional representation for Image Transformers (BEiT)
The BEiT model outperforms the previous state-of-the-art model by achieving a mean intersection over union of 49.4 on FoodSeg103.
arXiv Detail & Related papers (2023-06-15T15:38:10Z) - Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations.
We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z) - Cross-modal Retrieval and Synthesis (X-MRS): Closing the modality gap in
shared subspace [21.33710150033949]
We propose a simple yet novel architecture for shared subspace learning, which is used to tackle the food image-to-recipe retrieval problem.
Experimental analysis on the public Recipe1M dataset shows that the subspace learned via the proposed method outperforms the current state-of-the-arts.
In order to demonstrate the representational power of the learned subspace, we propose a generative food image synthesis model conditioned on the embeddings of recipes.
arXiv Detail & Related papers (2020-12-02T17:27:00Z) - Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images
and Recipes with Semantic Consistency and Attention Mechanism [70.85894675131624]
We learn an embedding of images and recipes in a common feature space, such that the corresponding image-recipe embeddings lie close to one another.
We propose Semantic-Consistent and Attention-based Networks (SCAN), which regularize the embeddings of the two modalities through aligning output semantic probabilities.
We show that we can outperform several state-of-the-art cross-modal retrieval strategies for food images and cooking recipes by a significant margin.
arXiv Detail & Related papers (2020-03-09T07:41:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.