Synthesizing Knowledge-enhanced Features for Real-world Zero-shot Food
Detection
- URL: http://arxiv.org/abs/2402.09242v1
- Date: Wed, 14 Feb 2024 15:32:35 GMT
- Title: Synthesizing Knowledge-enhanced Features for Real-world Zero-shot Food
Detection
- Authors: Pengfei Zhou, Weiqing Min, Jiajun Song, Yang Zhang, Shuqiang Jiang
- Abstract summary: Food detection needs Zero-Shot Detection (ZSD) on novel unseen food objects to support real-world scenarios.
We first benchmark the task of Zero-Shot Food Detection (ZSFD) by introducing FOWA dataset with rich attribute annotations.
We propose a novel framework ZSFDet to tackle fine-grained problems by exploiting the interaction between complex attributes.
- Score: 37.866458336327184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Food computing brings various perspectives to computer vision like
vision-based food analysis for nutrition and health. As a fundamental task in
food computing, food detection needs Zero-Shot Detection (ZSD) on novel unseen
food objects to support real-world scenarios, such as intelligent kitchens and
smart restaurants. Therefore, we first benchmark the task of Zero-Shot Food
Detection (ZSFD) by introducing FOWA dataset with rich attribute annotations.
Unlike ZSD, fine-grained problems in ZSFD like inter-class similarity make
synthesized features inseparable. The complexity of food semantic attributes
further makes it more difficult for current ZSD methods to distinguish various
food categories. To address these problems, we propose a novel framework ZSFDet
to tackle fine-grained problems by exploiting the interaction between complex
attributes. Specifically, we model the correlation between food categories and
attributes in ZSFDet by multi-source graphs to provide prior knowledge for
distinguishing fine-grained features. Within ZSFDet, Knowledge-Enhanced Feature
Synthesizer (KEFS) learns knowledge representation from multiple sources (e.g.,
ingredients correlation from knowledge graph) via the multi-source graph
fusion. Conditioned on the fusion of semantic knowledge representation, the
region feature diffusion model in KEFS can generate fine-grained features for
training the effective zero-shot detector. Extensive evaluations demonstrate
the superior performance of our method ZSFDet on FOWA and the widely-used food
dataset UECFOOD-256, with significant improvements by 1.8% and 3.7% ZSD mAP
compared with the strong baseline RRFS. Further experiments on PASCAL VOC and
MS COCO prove that enhancement of the semantic knowledge can also improve the
performance on general ZSD. Code and dataset are available at
https://github.com/LanceZPF/KEFS.
Related papers
- MetaFood3D: Large 3D Food Object Dataset with Nutrition Values [53.24500333363066]
This dataset consists of 637 meticulously labeled 3D food objects across 108 categories, featuring detailed nutrition information, weight, and food codes linked to a comprehensive nutrition database.
Experimental results demonstrate our dataset's significant potential for improving algorithm performance, highlight the challenging gap between video captures and 3D scanned data, and show the strength of the MetaFood3D dataset in high-quality data generation, simulation, and augmentation.
arXiv Detail & Related papers (2024-09-03T15:02:52Z) - RoDE: Linear Rectified Mixture of Diverse Experts for Food Large Multi-Modal Models [96.43285670458803]
Uni-Food is a unified food dataset that comprises over 100,000 images with various food labels.
Uni-Food is designed to provide a more holistic approach to food data analysis.
We introduce a novel Linear Rectification Mixture of Diverse Experts (RoDE) approach to address the inherent challenges of food-related multitasking.
arXiv Detail & Related papers (2024-07-17T16:49:34Z) - SeeDS: Semantic Separable Diffusion Synthesizer for Zero-shot Food
Detection [38.57712277980073]
We propose the Semantic Separable Diffusion Synthesizer (SeeDS) framework for Zero-Shot Food Detection (ZSFD)
SeeDS consists of two modules: a Semantic Separable Synthesizer Module (S$3$M) and a Region Feature Denoising Diffusion Model (RFDDM)
arXiv Detail & Related papers (2023-10-07T05:29:18Z) - Towards Building a Food Knowledge Graph for Internet of Food [66.57235827087092]
We review the evolution of food knowledge organization, from food classification to food to food knowledge graphs.
Food knowledge graphs play an important role in food search and Question Answering (QA), personalized dietary recommendation, food analysis and visualization.
Future directions for food knowledge graphs cover several fields such as multimodal food knowledge graphs and food intelligence.
arXiv Detail & Related papers (2021-07-13T06:26:53Z) - Visual Aware Hierarchy Based Food Recognition [10.194167945992938]
We propose a new two-step food recognition system using Convolutional Neural Networks (CNNs) as the backbone architecture.
The food localization step is based on an implementation of the Faster R-CNN method to identify food regions.
In the food classification step, visually similar food categories can be clustered together automatically to generate a hierarchical structure.
arXiv Detail & Related papers (2020-12-06T20:25:31Z) - ISIA Food-500: A Dataset for Large-Scale Food Recognition via Stacked
Global-Local Attention Network [50.7720194859196]
We introduce the dataset ISIA Food- 500 with 500 categories from the list in the Wikipedia and 399,726 images.
This dataset surpasses existing popular benchmark datasets by category coverage and data volume.
We propose a stacked global-local attention network, which consists of two sub-networks for food recognition.
arXiv Detail & Related papers (2020-08-13T02:48:27Z) - MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images
with Latent Variable Model [28.649961369386148]
We present Modality-Consistent Embedding Network (MCEN) that learns modality-invariant representations by projecting images and texts to the same embedding space.
Our method learns the cross-modal alignments during training but computes embeddings of different modalities independently at inference time for the sake of efficiency.
arXiv Detail & Related papers (2020-04-02T16:00:10Z) - Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images
and Recipes with Semantic Consistency and Attention Mechanism [70.85894675131624]
We learn an embedding of images and recipes in a common feature space, such that the corresponding image-recipe embeddings lie close to one another.
We propose Semantic-Consistent and Attention-based Networks (SCAN), which regularize the embeddings of the two modalities through aligning output semantic probabilities.
We show that we can outperform several state-of-the-art cross-modal retrieval strategies for food images and cooking recipes by a significant margin.
arXiv Detail & Related papers (2020-03-09T07:41:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.