BenchSeg: A Large-Scale Dataset and Benchmark for Multi-View Food Video Segmentation
- URL: http://arxiv.org/abs/2601.07581v2
- Date: Sun, 18 Jan 2026 11:08:11 GMT
- Title: BenchSeg: A Large-Scale Dataset and Benchmark for Multi-View Food Video Segmentation
- Authors: Ahmad AlMughrabi, Guillermo Rivo, Carlos Jiménez-Farfán, Umair Haroon, Farid Al-Areqi, Hyunjun Jung, Benjamin Busam, Ricardo Marques, Petia Radeva,
- Abstract summary: We introduce BenchSeg, a novel multi-view food video segmentation dataset and benchmark.<n> BenchSeg aggregates 55 dish scenes with 25,284 meticulously annotated frames, capturing each dish under free 360 camera motion.<n>We evaluate a diverse set of 20 state-of-the-art segmentation models on the existing FoodSeg103 dataset and evaluate them on BenchSeg.
- Score: 25.750204283738054
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Food image segmentation is a critical task for dietary analysis, enabling accurate estimation of food volume and nutrients. However, current methods suffer from limited multi-view data and poor generalization to new viewpoints. We introduce BenchSeg, a novel multi-view food video segmentation dataset and benchmark. BenchSeg aggregates 55 dish scenes (from Nutrition5k, Vegetables & Fruits, MetaFood3D, and FoodKit) with 25,284 meticulously annotated frames, capturing each dish under free 360° camera motion. We evaluate a diverse set of 20 state-of-the-art segmentation models (e.g., SAM-based, transformer, CNN, and large multimodal) on the existing FoodSeg103 dataset and evaluate them (alone and combined with video-memory modules) on BenchSeg. Quantitative and qualitative results demonstrate that while standard image segmenters degrade sharply under novel viewpoints, memory-augmented methods maintain temporal consistency across frames. Our best model based on a combination of SeTR-MLA+XMem2 outperforms prior work (e.g., improving over FoodMem by ~2.63% mAP), offering new insights into food segmentation and tracking for dietary analysis. In addition to frame-wise spatial accuracy, we introduce a dedicated temporal evaluation protocol that explicitly quantifies segmentation stability over time through continuity, flicker rate, and IoU drift metrics. This allows us to reveal failure modes that remain invisible under standard per-frame evaluations. We release BenchSeg to foster future research. The project page including the dataset annotations and the food segmentation models can be found at https://amughrabi.github.io/benchseg.
Related papers
- Understanding Image2Video Domain Shift in Food Segmentation: An Instance-level Analysis on Apples [0.2366840032676479]
In real-world applications such as food monitoring and instance counting, segmentation outputs must be temporally consistent.<n>We analyze this failure through an instance segmentation and tracking perspective, focusing on apples as a representative food category.<n>Our results reveal that high frame-wise segmentation accuracy does not translate to stable instance identities over time.
arXiv Detail & Related papers (2026-02-09T10:43:51Z) - LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets [54.527878056610156]
We present a framework empowered with large language models (LLMs) to address these challenges in food recognition.<n>We first leverage LLMs to parse food images to generate food titles and ingredients.<n>Then, we project the generated texts and food images from different domains to a shared embedding space to maximize the pair similarities.
arXiv Detail & Related papers (2025-11-20T04:38:56Z) - EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models [69.44009961659668]
We introduce the EPFL-Smart-Kitchen-30 dataset, collected in a motion capture platform inside a kitchen environment.<n>Nine static RGB-D cameras, inertial measurement units (IMUs) and one head-mounted HoloLens2 headset were used to capture 3D hand, body, and eye movements.<n>The dataset is a multi-view action dataset with synchronized exocentric, egocentric, depth, IMUs, eye gaze, body and hand kinematics spanning 29.7 hours of 16 subjects cooking four different recipes.
arXiv Detail & Related papers (2025-06-02T12:46:44Z) - MetaFood3D: 3D Food Dataset with Nutrition Values [52.16894900096017]
This dataset consists of 743 meticulously scanned and labeled 3D food objects across 131 categories.<n>Our MetaFood3D dataset emphasizes intra-class diversity and includes rich modalities such as textured mesh files, RGB-D videos, and segmentation masks.
arXiv Detail & Related papers (2024-09-03T15:02:52Z) - FoodMem: Near Real-time and Precise Food Video Segmentation [4.282795945742752]
Current limitations lead to inaccurate nutritional analysis, inefficient crop management, and suboptimal food processing.<n>This study introduces the development of a robust framework for high-quality, near-real-time segmentation and tracking of food items in videos.<n>We present FoodMem, a novel framework designed to segment food items from video sequences of 360-degree scenes.
arXiv Detail & Related papers (2024-07-16T19:15:07Z) - VolETA: One- and Few-shot Food Volume Estimation [4.282795945742752]
We present VolETA, a sophisticated methodology for estimating food volume using 3D generative techniques.
Our approach creates a scaled 3D mesh of food objects using one- or few-RGBD images.
We achieve robust and accurate volume estimations with 10.97% MAPE using the MTF dataset.
arXiv Detail & Related papers (2024-07-01T18:47:15Z) - Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT)
We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information.
Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z) - FoodLMM: A Versatile Food Assistant using Large Multi-modal Model [96.76271649854542]
Large Multi-modal Models (LMMs) have made impressive progress in many vision-language tasks.
This paper proposes FoodLMM, a versatile food assistant based on LMMs with various capabilities.
We introduce a series of novel task-specific tokens and heads, enabling the model to predict food nutritional values and multiple segmentation masks.
arXiv Detail & Related papers (2023-12-22T11:56:22Z) - FoodSAM: Any Food Segmentation [10.467966270491228]
We propose a novel framework, called FoodSAM, to address the lack of class-specific information in SAM-generated masks.
FoodSAM integrates the coarse semantic mask with SAM-generated masks to enhance semantic segmentation quality.
FoodSAM stands as the first-ever work to achieve instance, panoptic, and promptable segmentation on food images.
arXiv Detail & Related papers (2023-08-11T04:42:10Z) - Transferring Knowledge for Food Image Segmentation using Transformers
and Convolutions [65.50975507723827]
Food image segmentation is an important task that has ubiquitous applications, such as estimating the nutritional value of a plate of food.
One challenge is that food items can overlap and mix, making them difficult to distinguish.
Two models are trained and compared, one based on convolutional neural networks and the other on Bidirectional representation for Image Transformers (BEiT)
The BEiT model outperforms the previous state-of-the-art model by achieving a mean intersection over union of 49.4 on FoodSeg103.
arXiv Detail & Related papers (2023-06-15T15:38:10Z) - A Large-Scale Benchmark for Food Image Segmentation [62.28029856051079]
We build a new food image dataset FoodSeg103 (and its extension FoodSeg154) containing 9,490 images.
We annotate these images with 154 ingredient classes and each image has an average of 6 ingredient labels and pixel-wise masks.
We propose a multi-modality pre-training approach called ReLeM that explicitly equips a segmentation model with rich and semantic food knowledge.
arXiv Detail & Related papers (2021-05-12T03:00:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.