FMiFood: Multi-modal Contrastive Learning for Food Image Classification
- URL: http://arxiv.org/abs/2408.03922v1
- Date: Wed, 7 Aug 2024 17:29:19 GMT
- Title: FMiFood: Multi-modal Contrastive Learning for Food Image Classification
- Authors: Xinyue Pan, Jiangpeng He, Fengqing Zhu,
- Abstract summary: We introduce a novel multi-modal contrastive learning framework called FMiFood to learn more discriminative features.
Specifically, we propose a flexible matching technique that improves the similarity matching between text and image embeddings.
Our method demonstrates improved performance on both the UPMC-101 and VFN datasets compared to existing methods.
- Score: 8.019925729254178
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Food image classification is the fundamental step in image-based dietary assessment, which aims to estimate participants' nutrient intake from eating occasion images. A common challenge of food images is the intra-class diversity and inter-class similarity, which can significantly hinder classification performance. To address this issue, we introduce a novel multi-modal contrastive learning framework called FMiFood, which learns more discriminative features by integrating additional contextual information, such as food category text descriptions, to enhance classification accuracy. Specifically, we propose a flexible matching technique that improves the similarity matching between text and image embeddings to focus on multiple key information. Furthermore, we incorporate the classification objectives into the framework and explore the use of GPT-4 to enrich the text descriptions and provide more detailed context. Our method demonstrates improved performance on both the UPMC-101 and VFN datasets compared to existing methods.
Related papers
- OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation [43.65207396061584]
OVFoodSeg is a framework that enhances text embeddings with visual context.
The training process of OVFoodSeg is divided into two stages: the pre-training of FoodLearner and the subsequent learning phase for segmentation.
By addressing the deficiencies of previous models, OVFoodSeg demonstrates a significant improvement, achieving a 4.9% increase in mean Intersection over Union (mIoU) on the FoodSeg103 dataset.
arXiv Detail & Related papers (2024-04-01T18:26:29Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - Muti-Stage Hierarchical Food Classification [9.013592803864086]
We propose a multi-stage hierarchical framework for food item classification by iteratively clustering and merging food items during the training process.
Our method is evaluated on VFN-nutrient dataset and achieve promising results compared with existing work in terms of both food type and food item classification.
arXiv Detail & Related papers (2023-09-03T04:45:44Z) - Food Classification using Joint Representation of Visual and Textual
Data [45.94375447042821]
We propose a multimodal classification framework that uses the modified version of EfficientNet with the Mish activation function for image classification.
The proposed network and the other state-of-the-art methods are evaluated on a large open-source dataset, UPMC Food-101.
arXiv Detail & Related papers (2023-08-03T04:03:46Z) - Transferring Knowledge for Food Image Segmentation using Transformers
and Convolutions [65.50975507723827]
Food image segmentation is an important task that has ubiquitous applications, such as estimating the nutritional value of a plate of food.
One challenge is that food items can overlap and mix, making them difficult to distinguish.
Two models are trained and compared, one based on convolutional neural networks and the other on Bidirectional representation for Image Transformers (BEiT)
The BEiT model outperforms the previous state-of-the-art model by achieving a mean intersection over union of 49.4 on FoodSeg103.
arXiv Detail & Related papers (2023-06-15T15:38:10Z) - EAML: Ensemble Self-Attention-based Mutual Learning Network for Document
Image Classification [1.1470070927586016]
We design a self-attention-based fusion module that serves as a block in our ensemble trainable network.
It allows to simultaneously learn the discriminant features of image and text modalities throughout the training stage.
This is the first time to leverage a mutual learning approach along with a self-attention-based fusion module to perform document image classification.
arXiv Detail & Related papers (2023-05-11T16:05:03Z) - Towards the Creation of a Nutrition and Food Group Based Image Database [58.429385707376554]
We propose a framework to create a nutrition and food group based image database.
We design a protocol for linking food group based food codes in the U.S. Department of Agriculture's (USDA) Food and Nutrient Database for Dietary Studies (FNDDS)
Our proposed method is used to build a nutrition and food group based image database including 16,114 food datasets.
arXiv Detail & Related papers (2022-06-05T02:41:44Z) - Improving Dietary Assessment Via Integrated Hierarchy Food
Classification [7.398060062678395]
We introduce a new food classification framework to improve the quality of predictions by integrating the information from multiple domains.
Our method is validated on the modified VIPER-FoodNet (VFN) food image dataset by including associated energy and nutrient information.
arXiv Detail & Related papers (2021-09-06T20:59:58Z) - Multi-Label Image Classification with Contrastive Learning [57.47567461616912]
We show that a direct application of contrastive learning can hardly improve in multi-label cases.
We propose a novel framework for multi-label classification with contrastive learning in a fully supervised setting.
arXiv Detail & Related papers (2021-07-24T15:00:47Z) - A Large-Scale Benchmark for Food Image Segmentation [62.28029856051079]
We build a new food image dataset FoodSeg103 (and its extension FoodSeg154) containing 9,490 images.
We annotate these images with 154 ingredient classes and each image has an average of 6 ingredient labels and pixel-wise masks.
We propose a multi-modality pre-training approach called ReLeM that explicitly equips a segmentation model with rich and semantic food knowledge.
arXiv Detail & Related papers (2021-05-12T03:00:07Z) - Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images
and Recipes with Semantic Consistency and Attention Mechanism [70.85894675131624]
We learn an embedding of images and recipes in a common feature space, such that the corresponding image-recipe embeddings lie close to one another.
We propose Semantic-Consistent and Attention-based Networks (SCAN), which regularize the embeddings of the two modalities through aligning output semantic probabilities.
We show that we can outperform several state-of-the-art cross-modal retrieval strategies for food images and cooking recipes by a significant margin.
arXiv Detail & Related papers (2020-03-09T07:41:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.