SalientFusion: Context-Aware Compositional Zero-Shot Food Recognition
- URL: http://arxiv.org/abs/2509.03873v1
- Date: Thu, 04 Sep 2025 04:22:36 GMT
- Title: SalientFusion: Context-Aware Compositional Zero-Shot Food Recognition
- Authors: Jiajun Song, Xiaoou Liu,
- Abstract summary: We propose the task of Compositional Zero-Shot Food Recognition (CZSFR)<n>Redundant background information distracts models from learning meaningful food features.<n>Role confusion between staple and side dishes leads to misclassification.<n>Scientist bias in a single attribute can lead to confusion of understanding.
- Score: 4.982130518684668
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Food recognition has gained significant attention, but the rapid emergence of new dishes requires methods for recognizing unseen food categories, motivating Zero-Shot Food Learning (ZSFL). We propose the task of Compositional Zero-Shot Food Recognition (CZSFR), where cuisines and ingredients naturally align with attributes and objects in Compositional Zero-Shot learning (CZSL). However, CZSFR faces three challenges: (1) Redundant background information distracts models from learning meaningful food features, (2) Role confusion between staple and side dishes leads to misclassification, and (3) Semantic bias in a single attribute can lead to confusion of understanding. Therefore, we propose SalientFusion, a context-aware CZSFR method with two components: SalientFormer, which removes background redundancy and uses depth features to resolve role confusion; DebiasAT, which reduces the semantic bias by aligning prompts with visual features. Using our proposed benchmarks, CZSFood-90 and CZSFood-164, we show that SalientFusion achieves state-of-the-art results on these benchmarks and the most popular general datasets for the general CZSL. The code is avaliable at https://github.com/Jiajun-RUC/SalientFusion.
Related papers
- LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets [54.527878056610156]
We present a framework empowered with large language models (LLMs) to address these challenges in food recognition.<n>We first leverage LLMs to parse food images to generate food titles and ingredients.<n>Then, we project the generated texts and food images from different domains to a shared embedding space to maximize the pair similarities.
arXiv Detail & Related papers (2025-11-20T04:38:56Z) - SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models [80.90808879991182]
Drawing on systematic analysis, we uncover two fundamental limitations of foundation models for sketch understanding.<n>We address these limitations by strategically combining SD with CLIP, whose strong semantic understanding naturally compensates for SD's spatial-frequency biases.<n>By dynamically injecting CLIP features into SD's denoising process and adaptively aggregating features across semantic levels, our method achieves state-of-the-art performance in sketch retrieval.
arXiv Detail & Related papers (2025-03-18T10:47:46Z) - OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation [43.65207396061584]
OVFoodSeg is a framework that enhances text embeddings with visual context.
The training process of OVFoodSeg is divided into two stages: the pre-training of FoodLearner and the subsequent learning phase for segmentation.
By addressing the deficiencies of previous models, OVFoodSeg demonstrates a significant improvement, achieving a 4.9% increase in mean Intersection over Union (mIoU) on the FoodSeg103 dataset.
arXiv Detail & Related papers (2024-04-01T18:26:29Z) - Synthesizing Knowledge-enhanced Features for Real-world Zero-shot Food
Detection [37.866458336327184]
Food detection needs Zero-Shot Detection (ZSD) on novel unseen food objects to support real-world scenarios.
We first benchmark the task of Zero-Shot Food Detection (ZSFD) by introducing FOWA dataset with rich attribute annotations.
We propose a novel framework ZSFDet to tackle fine-grained problems by exploiting the interaction between complex attributes.
arXiv Detail & Related papers (2024-02-14T15:32:35Z) - SeeDS: Semantic Separable Diffusion Synthesizer for Zero-shot Food
Detection [38.57712277980073]
We propose the Semantic Separable Diffusion Synthesizer (SeeDS) framework for Zero-Shot Food Detection (ZSFD)
SeeDS consists of two modules: a Semantic Separable Synthesizer Module (S$3$M) and a Region Feature Denoising Diffusion Model (RFDDM)
arXiv Detail & Related papers (2023-10-07T05:29:18Z) - Hierarchical Visual Primitive Experts for Compositional Zero-Shot
Learning [52.506434446439776]
Compositional zero-shot learning (CZSL) aims to recognize compositions with prior knowledge of known primitives (attribute and object)
We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues.
Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL.
arXiv Detail & Related papers (2023-08-08T03:24:21Z) - Transferring Knowledge for Food Image Segmentation using Transformers
and Convolutions [65.50975507723827]
Food image segmentation is an important task that has ubiquitous applications, such as estimating the nutritional value of a plate of food.
One challenge is that food items can overlap and mix, making them difficult to distinguish.
Two models are trained and compared, one based on convolutional neural networks and the other on Bidirectional representation for Image Transformers (BEiT)
The BEiT model outperforms the previous state-of-the-art model by achieving a mean intersection over union of 49.4 on FoodSeg103.
arXiv Detail & Related papers (2023-06-15T15:38:10Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - A Large-Scale Benchmark for Food Image Segmentation [62.28029856051079]
We build a new food image dataset FoodSeg103 (and its extension FoodSeg154) containing 9,490 images.
We annotate these images with 154 ingredient classes and each image has an average of 6 ingredient labels and pixel-wise masks.
We propose a multi-modality pre-training approach called ReLeM that explicitly equips a segmentation model with rich and semantic food knowledge.
arXiv Detail & Related papers (2021-05-12T03:00:07Z) - Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images
and Recipes with Semantic Consistency and Attention Mechanism [70.85894675131624]
We learn an embedding of images and recipes in a common feature space, such that the corresponding image-recipe embeddings lie close to one another.
We propose Semantic-Consistent and Attention-based Networks (SCAN), which regularize the embeddings of the two modalities through aligning output semantic probabilities.
We show that we can outperform several state-of-the-art cross-modal retrieval strategies for food images and cooking recipes by a significant margin.
arXiv Detail & Related papers (2020-03-09T07:41:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.