MALM: Mask Augmentation based Local Matching for Food-Recipe Retrieval
- URL: http://arxiv.org/abs/2305.11327v1
- Date: Thu, 18 May 2023 22:25:50 GMT
- Title: MALM: Mask Augmentation based Local Matching for Food-Recipe Retrieval
- Authors: Bhanu Prakash Voutharoja and Peng Wang and Lei Wang and Vivienne Guan
- Abstract summary: We propose a mask-augmentation-based local matching network (MALM) for image-to-recipe retrieval.
Experimental results on Recipe1M dataset show our method can clearly outperform state-of-the-art (SOTA) methods.
- Score: 6.582204441933583
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-to-recipe retrieval is a challenging vision-to-language task of
significant practical value. The main challenge of the task lies in the
ultra-high redundancy in the long recipe and the large variation reflected in
both food item combination and food item appearance. A de-facto idea to address
this task is to learn a shared feature embedding space in which a food image is
aligned better to its paired recipe than other recipes. However, such
supervised global matching is prone to supervision collapse, i.e., only partial
information that is necessary for distinguishing training pairs can be
identified, while other information that is potentially useful in
generalization could be lost. To mitigate such a problem, we propose a
mask-augmentation-based local matching network (MALM), where an image-text
matching module and a masked self-distillation module benefit each other
mutually to learn generalizable cross-modality representations. On one hand, we
perform local matching between the tokenized representations of image and text
to locate fine-grained cross-modality correspondence explicitly. We involve
representations of masked image patches in this process to alleviate
overfitting resulting from local matching especially when some food items are
underrepresented. On the other hand, predicting the hidden representations of
the masked patches through self-distillation helps to learn general-purpose
image representations that are expected to generalize better. And the
multi-task nature of the model enables the representations of masked patches to
be text-aware and thus facilitates the lost information reconstruction.
Experimental results on Recipe1M dataset show our method can clearly outperform
state-of-the-art (SOTA) methods. Our code will be available at
https://github.com/MyFoodChoice/MALM_Mask_Augmentation_based_Local_Matching-_for-_Food_Recipe_Retrie val
Related papers
- MaskInversion: Localized Embeddings via Optimization of Explainability Maps [49.50785637749757]
MaskInversion generates a context-aware embedding for a query image region specified by a mask at test time.
It can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation.
arXiv Detail & Related papers (2024-07-29T14:21:07Z) - CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition.
Our method uses the attention mechanism to correlate multiple images within a batch.
Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z) - Improving fine-grained understanding in image-text pre-training [37.163228122323865]
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs.
We show improved performance over competing approaches over both image-level tasks relying on coarse-grained information.
arXiv Detail & Related papers (2024-01-18T10:28:45Z) - Text Augmented Spatial-aware Zero-shot Referring Image Segmentation [60.84423786769453]
We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework.
TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing.
The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
arXiv Detail & Related papers (2023-10-27T10:52:50Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers
and Self-supervised Learning [17.42688184238741]
Cross-modal recipe retrieval has recently gained substantial attention due to the importance of food in people's lives.
We propose a simplified end-to-end model based on well established and high performing encoders for text and images.
Our proposed method achieves state-of-the-art performance in the cross-modal recipe retrieval task on the Recipe1M dataset.
arXiv Detail & Related papers (2021-03-24T10:17:09Z) - CHEF: Cross-modal Hierarchical Embeddings for Food Domain Retrieval [20.292467149387594]
We introduce a novel cross-modal learning framework to jointly model the latent representations of images and text in the food image-recipe association and retrieval tasks.
Our experiments show that by making use of efficient tree-structured Long Short-Term Memory as the text encoder in our computational cross-modal retrieval framework, we are able to identify the main ingredients and cooking actions in the recipe descriptions without explicit supervision.
arXiv Detail & Related papers (2021-02-04T11:24:34Z) - Seed the Views: Hierarchical Semantic Alignment for Contrastive
Representation Learning [116.91819311885166]
We propose a hierarchical semantic alignment strategy via expanding the views generated by a single image to textbfCross-samples and Multi-level representation.
Our method, termed as CsMl, has the ability to integrate multi-level visual representations across samples in a robust way.
arXiv Detail & Related papers (2020-12-04T17:26:24Z) - Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images
and Recipes with Semantic Consistency and Attention Mechanism [70.85894675131624]
We learn an embedding of images and recipes in a common feature space, such that the corresponding image-recipe embeddings lie close to one another.
We propose Semantic-Consistent and Attention-based Networks (SCAN), which regularize the embeddings of the two modalities through aligning output semantic probabilities.
We show that we can outperform several state-of-the-art cross-modal retrieval strategies for food images and cooking recipes by a significant margin.
arXiv Detail & Related papers (2020-03-09T07:41:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.