Understanding Image2Video Domain Shift in Food Segmentation: An Instance-level Analysis on Apples
- URL: http://arxiv.org/abs/2602.08491v2
- Date: Tue, 10 Feb 2026 12:17:02 GMT
- Title: Understanding Image2Video Domain Shift in Food Segmentation: An Instance-level Analysis on Apples
- Authors: Keonvin Park, Aditya Pal, Jin Hong Mok,
- Abstract summary: In real-world applications such as food monitoring and instance counting, segmentation outputs must be temporally consistent.<n>We analyze this failure through an instance segmentation and tracking perspective, focusing on apples as a representative food category.<n>Our results reveal that high frame-wise segmentation accuracy does not translate to stable instance identities over time.
- Score: 0.2366840032676479
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Food segmentation models trained on static images have achieved strong performance on benchmark datasets; however, their reliability in video settings remains poorly understood. In real-world applications such as food monitoring and instance counting, segmentation outputs must be temporally consistent, yet image-trained models often break down when deployed on videos. In this work, we analyze this failure through an instance segmentation and tracking perspective, focusing on apples as a representative food category. Models are trained solely on image-level food segmentation data and evaluated on video sequences using an instance segmentation with tracking-by-matching framework, enabling object-level temporal analysis. Our results reveal that high frame-wise segmentation accuracy does not translate to stable instance identities over time. Temporal appearance variations, particularly illumination changes, specular reflections, and texture ambiguity, lead to mask flickering and identity fragmentation, resulting in significant errors in apple counting. These failures are largely overlooked by conventional image-based metrics, which substantially overestimate real-world video performance. Beyond diagnosing the problem, we examine practical remedies that do not require full video supervision, including post-hoc temporal regularization and self-supervised temporal consistency objectives. Our findings suggest that the root cause of failure lies in image-centric training objectives that ignore temporal coherence, rather than model capacity. This study highlights a critical evaluation gap in food segmentation research and motivates temporally-aware learning and evaluation protocols for video-based food analysis.
Related papers
- Loss Knows Best: Detecting Annotation Errors in Videos via Loss Trajectories [11.597228102492672]
We propose a model-agnostic method for detecting annotation errors in video datasets.<n>Our method does not require ground truth on annotation errors and is generalizable across datasets.<n>EgoPER and Cholec80 demonstrate strong detection performance, effectively identifying subtle inconsistencies such as mislabeling and frame disordering.
arXiv Detail & Related papers (2026-02-16T19:53:58Z) - BenchSeg: A Large-Scale Dataset and Benchmark for Multi-View Food Video Segmentation [25.750204283738054]
We introduce BenchSeg, a novel multi-view food video segmentation dataset and benchmark.<n> BenchSeg aggregates 55 dish scenes with 25,284 meticulously annotated frames, capturing each dish under free 360 camera motion.<n>We evaluate a diverse set of 20 state-of-the-art segmentation models on the existing FoodSeg103 dataset and evaluate them on BenchSeg.
arXiv Detail & Related papers (2026-01-12T14:32:51Z) - LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets [54.527878056610156]
We present a framework empowered with large language models (LLMs) to address these challenges in food recognition.<n>We first leverage LLMs to parse food images to generate food titles and ingredients.<n>Then, we project the generated texts and food images from different domains to a shared embedding space to maximize the pair similarities.
arXiv Detail & Related papers (2025-11-20T04:38:56Z) - Temporal Prompting Matters: Rethinking Referring Video Object Segmentation [64.82333675385802]
Referring Video Object (RVOS) aims to segment the object referred to by the query sentence in the video.<n>Most existing methods require end-to-end training with dense mask annotations.<n>We propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors.
arXiv Detail & Related papers (2025-10-08T17:59:57Z) - A smart fridge with AI-enabled food computing [0.0]
The Internet of Things (IoT) plays a crucial role in enabling seamless connectivity and intelligent home automation, particularly in food management.<n>By integrating IoT with computer vision, the smart fridge employs an ESP32-CAM to establish a monitoring subsystem that enhances food management efficiency through real-time food detection, inventory tracking, and temperature monitoring.
arXiv Detail & Related papers (2025-09-09T05:29:00Z) - Understanding Long Videos via LLM-Powered Entity Relation Graphs [51.13422967711056]
GraphVideoAgent is a framework that maps and monitors the evolving relationships between visual entities throughout the video sequence.<n>Our approach demonstrates remarkable effectiveness when tested against industry benchmarks.
arXiv Detail & Related papers (2025-01-27T10:57:24Z) - Visual Context-Aware Person Fall Detection [52.49277799455569]
We present a segmentation pipeline to semi-automatically separate individuals and objects in images.
Background objects such as beds, chairs, or wheelchairs can challenge fall detection systems, leading to false positive alarms.
We demonstrate that object-specific contextual transformations during training effectively mitigate this challenge.
arXiv Detail & Related papers (2024-04-11T19:06:36Z) - From Canteen Food to Daily Meals: Generalizing Food Recognition to More
Practical Scenarios [92.58097090916166]
We present two new benchmarks, namely DailyFood-172 and DailyFood-16, designed to curate food images from everyday meals.
These two datasets are used to evaluate the transferability of approaches from the well-curated food image domain to the everyday-life food image domain.
arXiv Detail & Related papers (2024-03-12T08:32:23Z) - NutritionVerse: Empirical Study of Various Dietary Intake Estimation Approaches [59.38343165508926]
Accurate dietary intake estimation is critical for informing policies and programs to support healthy eating.
Recent work has focused on using computer vision and machine learning to automatically estimate dietary intake from food images.
We introduce NutritionVerse- Synth, the first large-scale dataset of 84,984 synthetic 2D food images with associated dietary information.
We also collect a real image dataset, NutritionVerse-Real, containing 889 images of 251 dishes to evaluate realism.
arXiv Detail & Related papers (2023-09-14T13:29:41Z) - Food Image Classification and Segmentation with Attention-based Multiple
Instance Learning [51.279800092581844]
The paper presents a weakly supervised methodology for training food image classification and semantic segmentation models.
The proposed methodology is based on a multiple instance learning approach in combination with an attention-based mechanism.
We conduct experiments on two meta-classes within the FoodSeg103 data set to verify the feasibility of the proposed approach.
arXiv Detail & Related papers (2023-08-22T13:59:47Z) - Rethinking Cooking State Recognition with Vision Transformers [0.0]
Self-attention mechanism of Vision Transformer (ViT) architecture is proposed for the Cooking State Recognition task.
The proposed approach encapsulates the globally salient features from images, while also exploiting the weights learned from a larger dataset.
Our framework has an accuracy of 94.3%, which significantly outperforms the state-of-the-art.
arXiv Detail & Related papers (2022-12-16T17:06:28Z) - A novel illumination condition varied image dataset-Food Vision Dataset
(FVD) for fair and reliable consumer acceptability predictions from food [0.0]
Group presents a novel dataset, the Food Vision dataset (FVD), to quantify illumination effects on human and computer perceptions.
FVD consists of 675 images captured under 3 different power and 5 different temperature settings every alternate day for five such days.
arXiv Detail & Related papers (2022-09-14T22:46:42Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Does Thermal data make the detection systems more reliable? [1.2891210250935146]
We propose a comprehensive detection system based on a multimodal-collaborative framework.
This framework learns from both RGB (from visual cameras) and thermal (from Infrared cameras) data.
Our empirical results show that while the improvement in accuracy is nominal, the value lies in challenging and extremely difficult edge cases.
arXiv Detail & Related papers (2021-11-09T15:04:34Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - A Robust Illumination-Invariant Camera System for Agricultural
Applications [7.349727826230863]
Object detection and semantic segmentation are two of the most widely adopted deep learning algorithms in agricultural applications.
We present a high throughput robust active lighting-based camera system that generates consistent images in all lighting conditions.
On average, deep nets for object detection trained on consistent data required nearly four times less data to achieve similar level of accuracy.
arXiv Detail & Related papers (2021-01-06T18:50:53Z) - Coherent Loss: A Generic Framework for Stable Video Segmentation [103.78087255807482]
We investigate how a jittering artifact degrades the visual quality of video segmentation results.
We propose a Coherent Loss with a generic framework to enhance the performance of a neural network against jittering artifacts.
arXiv Detail & Related papers (2020-10-25T10:48:28Z) - ISIA Food-500: A Dataset for Large-Scale Food Recognition via Stacked
Global-Local Attention Network [50.7720194859196]
We introduce the dataset ISIA Food- 500 with 500 categories from the list in the Wikipedia and 399,726 images.
This dataset surpasses existing popular benchmark datasets by category coverage and data volume.
We propose a stacked global-local attention network, which consists of two sub-networks for food recognition.
arXiv Detail & Related papers (2020-08-13T02:48:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.