Related papers: Transferring Knowledge for Food Image Segmentation using Transformers and Convolutions

Transferring Knowledge for Food Image Segmentation using Transformers and Convolutions

URL: http://arxiv.org/abs/2306.09203v1
Date: Thu, 15 Jun 2023 15:38:10 GMT
Title: Transferring Knowledge for Food Image Segmentation using Transformers and Convolutions
Authors: Grant Sinha, Krish Parmar, Hilda Azimi, Amy Tai, Yuhao Chen, Alexander Wong, Pengcheng Xi
Abstract summary: Food image segmentation is an important task that has ubiquitous applications, such as estimating the nutritional value of a plate of food. One challenge is that food items can overlap and mix, making them difficult to distinguish. Two models are trained and compared, one based on convolutional neural networks and the other on Bidirectional representation for Image Transformers (BEiT) The BEiT model outperforms the previous state-of-the-art model by achieving a mean intersection over union of 49.4 on FoodSeg103.
Score: 65.50975507723827
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Food image segmentation is an important task that has ubiquitous applications, such as estimating the nutritional value of a plate of food. Although machine learning models have been used for segmentation in this domain, food images pose several challenges. One challenge is that food items can overlap and mix, making them difficult to distinguish. Another challenge is the degree of inter-class similarity and intra-class variability, which is caused by the varying preparation methods and dishes a food item may be served in. Additionally, class imbalance is an inevitable issue in food datasets. To address these issues, two models are trained and compared, one based on convolutional neural networks and the other on Bidirectional Encoder representation for Image Transformers (BEiT). The models are trained and valuated using the FoodSeg103 dataset, which is identified as a robust benchmark for food image segmentation. The BEiT model outperforms the previous state-of-the-art model by achieving a mean intersection over union of 49.4 on FoodSeg103. This study provides insights into transfering knowledge using convolution and Transformer-based approaches in the food image domain.

Related papers

OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation [43.65207396061584]
OVFoodSeg is a framework that enhances text embeddings with visual context. The training process of OVFoodSeg is divided into two stages: the pre-training of FoodLearner and the subsequent learning phase for segmentation. By addressing the deficiencies of previous models, OVFoodSeg demonstrates a significant improvement, achieving a 4.9% increase in mean Intersection over Union (mIoU) on the FoodSeg103 dataset.
arXiv Detail & Related papers (2024-04-01T18:26:29Z)
From Canteen Food to Daily Meals: Generalizing Food Recognition to More Practical Scenarios [92.58097090916166]
We present two new benchmarks, namely DailyFood-172 and DailyFood-16, designed to curate food images from everyday meals. These two datasets are used to evaluate the transferability of approaches from the well-curated food image domain to the everyday-life food image domain.
arXiv Detail & Related papers (2024-03-12T08:32:23Z)
FoodFusion: A Latent Diffusion Model for Realistic Food Image Generation [69.91401809979709]
Current state-of-the-art image generation models such as Latent Diffusion Models (LDMs) have demonstrated the capacity to produce visually striking food-related images. We introduce FoodFusion, a Latent Diffusion model engineered specifically for the faithful synthesis of realistic food images from textual descriptions. The development of the FoodFusion model involves harnessing an extensive array of open-source food datasets, resulting in over 300,000 curated image-caption pairs.
arXiv Detail & Related papers (2023-12-06T15:07:12Z)
Diffusion Model with Clustering-based Conditioning for Food Image Generation [22.154182296023404]
Deep learning-based techniques are commonly used to perform image analysis such as food classification, segmentation, and portion size estimation. One potential solution is to use synthetic food images for data augmentation. In this paper, we propose an effective clustering-based training framework, named ClusDiff, for generating high-quality and representative food images.
arXiv Detail & Related papers (2023-09-01T01:40:39Z)
Food Image Classification and Segmentation with Attention-based Multiple Instance Learning [51.279800092581844]
The paper presents a weakly supervised methodology for training food image classification and semantic segmentation models. The proposed methodology is based on a multiple instance learning approach in combination with an attention-based mechanism. We conduct experiments on two meta-classes within the FoodSeg103 data set to verify the feasibility of the proposed approach.
arXiv Detail & Related papers (2023-08-22T13:59:47Z)
Single-Stage Heavy-Tailed Food Classification [7.800379384628357]
We introduce a novel single-stage heavy-tailed food classification framework. Our method is evaluated on two heavy-tailed food benchmark datasets, Food101-LT and VFN-LT.
arXiv Detail & Related papers (2023-07-01T00:45:35Z)
A Large-Scale Benchmark for Food Image Segmentation [62.28029856051079]
We build a new food image dataset FoodSeg103 (and its extension FoodSeg154) containing 9,490 images. We annotate these images with 154 ingredient classes and each image has an average of 6 ingredient labels and pixel-wise masks. We propose a multi-modality pre-training approach called ReLeM that explicitly equips a segmentation model with rich and semantic food knowledge.
arXiv Detail & Related papers (2021-05-12T03:00:07Z)
Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes with Semantic Consistency and Attention Mechanism [70.85894675131624]
We learn an embedding of images and recipes in a common feature space, such that the corresponding image-recipe embeddings lie close to one another. We propose Semantic-Consistent and Attention-based Networks (SCAN), which regularize the embeddings of the two modalities through aligning output semantic probabilities. We show that we can outperform several state-of-the-art cross-modal retrieval strategies for food images and cooking recipes by a significant margin.
arXiv Detail & Related papers (2020-03-09T07:41:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.