Implicit-Scale 3D Reconstruction for Multi-Food Volume Estimation from Monocular Images
- URL: http://arxiv.org/abs/2602.13041v1
- Date: Fri, 13 Feb 2026 15:52:39 GMT
- Title: Implicit-Scale 3D Reconstruction for Multi-Food Volume Estimation from Monocular Images
- Authors: Yuhao Chen, Gautham Vinod, Siddeshwar Raghavan, Talha Ibn Mahmud, Bruce Coburn, Jinge Ma, Fengqing Zhu, Jiangpeng He,
- Abstract summary: Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images is a benchmark dataset designed to advance geometry-based food portion estimation.<n>This benchmark reframes food portion estimation as an implicit-scale 3D reconstruction problem under monocular observations.
- Score: 21.112563168240737
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images, a benchmark dataset designed to advance geometry-based food portion estimation in realistic dining scenarios. Existing dietary assessment methods largely rely on single-image analysis or appearance-based inference, including recent vision-language models, which lack explicit geometric reasoning and are sensitive to scale ambiguity. This benchmark reframes food portion estimation as an implicit-scale 3D reconstruction problem under monocular observations. To reflect real-world conditions, explicit physical references and metric annotations are removed; instead, contextual objects such as plates and utensils are provided, requiring algorithms to infer scale from implicit cues and prior knowledge. The dataset emphasizes multi-food scenes with diverse object geometries, frequent occlusions, and complex spatial arrangements. The benchmark was adopted as a challenge at the MetaFood 2025 Workshop, where multiple teams proposed reconstruction-based solutions. Experimental results show that while strong vision--language baselines achieve competitive performance, geometry-based reconstruction methods provide both improved accuracy and greater robustness, with the top-performing approach achieving 0.21 MAPE in volume estimation and 5.7 L1 Chamfer Distance in geometric accuracy.
Related papers
- Size Matters: Reconstructing Real-Scale 3D Models from Monocular Images for Food Portion Estimation [19.138014263791803]
We bridge the gap between 3D computer vision and digital health by proposing a method that recovers a true-to-scale 3D reconstructed object from a monocular image.<n>Our approach leverages rich visual features extracted from models trained on large-scale datasets to estimate the scale of the reconstructed object.
arXiv Detail & Related papers (2026-01-27T20:53:45Z) - TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning [104.66714520975837]
We introduce a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game.<n>We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications.<n>We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints.
arXiv Detail & Related papers (2026-01-23T07:35:05Z) - PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning [49.66437612420291]
PoseGAM is a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images.<n>We construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions.
arXiv Detail & Related papers (2025-12-11T17:29:25Z) - Gaussian Alignment for Relative Camera Pose Estimation via Single-View Reconstruction [18.936573991468926]
GARPS is a training-free framework that casts this problem as the direct alignment of two independently reconstructed 3D scenes.<n>It refines an initial pose from a feed-forward two-view pose estimator by optimising a differentiable GMM alignment objective.<n>Experiments on the Real-Estate10K dataset demonstrate that GARPS outperforms both classical and state-of-the-art learning-based methods.
arXiv Detail & Related papers (2025-09-17T02:57:34Z) - GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra [33.53387523266523]
We introduce GIQ, a benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models.<n> GIQ comprises synthetic and real-world images of 224 diverse polyhedra.
arXiv Detail & Related papers (2025-06-09T20:11:21Z) - Mono3R: Exploiting Monocular Cues for Geometric 3D Reconstruction [11.220655907305515]
We introduce a monocular-guided refinement module that integrates monocular geometric priors into multi-view reconstruction frameworks.<n>Our method achieves substantial improvements in both mutli-view camera pose estimation and point cloud accuracy.
arXiv Detail & Related papers (2025-04-18T02:33:12Z) - UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image [86.7128543480229]
Unseen object pose estimation methods often rely on CAD models or multiple reference views.<n>To simplify reference acquisition, we aim to estimate the unseen object's pose through a single unposed RGB-D reference image.<n>We present a novel approach and benchmark, termed UNOPose, for unseen one-reference-based object pose estimation.
arXiv Detail & Related papers (2024-11-25T05:36:00Z) - MFP3D: Monocular Food Portion Estimation Leveraging 3D Point Clouds [7.357322789192671]
In this paper, we introduce a new framework for accurate food estimation using only a single monocular image.
The framework consists of three key modules: (1) a 3D Reconstruction Module that generates a 3D point cloud representation of the food from the 2D image, (2) a Feature Extraction Module that extracts and represents features from both the 3D point cloud and the 2D RGB image, and (3) a Portion Regression Module that employs a deep regression model to estimate the food's volume and energy content.
arXiv Detail & Related papers (2024-11-14T22:17:27Z) - FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models [67.96827539201071]
We propose a novel test-time optimization approach for 3D scene reconstruction.
Our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
arXiv Detail & Related papers (2023-08-10T17:55:02Z) - Single-view 3D Mesh Reconstruction for Seen and Unseen Categories [69.29406107513621]
Single-view 3D Mesh Reconstruction is a fundamental computer vision task that aims at recovering 3D shapes from single-view RGB images.
This paper tackles Single-view 3D Mesh Reconstruction, to study the model generalization on unseen categories.
We propose an end-to-end two-stage network, GenMesh, to break the category boundaries in reconstruction.
arXiv Detail & Related papers (2022-08-04T14:13:35Z) - Single View Metrology in the Wild [94.7005246862618]
We present a novel approach to single view metrology that can recover the absolute scale of a scene represented by 3D heights of objects or camera height above the ground.
Our method relies on data-driven priors learned by a deep network specifically designed to imbibe weakly supervised constraints from the interplay of the unknown camera with 3D entities such as object heights.
We demonstrate state-of-the-art qualitative and quantitative results on several datasets as well as applications including virtual object insertion.
arXiv Detail & Related papers (2020-07-18T22:31:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.