Related papers: Implicit-Scale 3D Reconstruction for Multi-Food Volume Estimation from Monocular Images

Implicit-Scale 3D Reconstruction for Multi-Food Volume Estimation from Monocular Images

URL: http://arxiv.org/abs/2602.13041v1
Date: Fri, 13 Feb 2026 15:52:39 GMT
Title: Implicit-Scale 3D Reconstruction for Multi-Food Volume Estimation from Monocular Images
Authors: Yuhao Chen, Gautham Vinod, Siddeshwar Raghavan, Talha Ibn Mahmud, Bruce Coburn, Jinge Ma, Fengqing Zhu, Jiangpeng He,
Abstract summary: Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images is a benchmark dataset designed to advance geometry-based food portion estimation.<n>This benchmark reframes food portion estimation as an implicit-scale 3D reconstruction problem under monocular observations.
Score: 21.112563168240737
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images, a benchmark dataset designed to advance geometry-based food portion estimation in realistic dining scenarios. Existing dietary assessment methods largely rely on single-image analysis or appearance-based inference, including recent vision-language models, which lack explicit geometric reasoning and are sensitive to scale ambiguity. This benchmark reframes food portion estimation as an implicit-scale 3D reconstruction problem under monocular observations. To reflect real-world conditions, explicit physical references and metric annotations are removed; instead, contextual objects such as plates and utensils are provided, requiring algorithms to infer scale from implicit cues and prior knowledge. The dataset emphasizes multi-food scenes with diverse object geometries, frequent occlusions, and complex spatial arrangements. The benchmark was adopted as a challenge at the MetaFood 2025 Workshop, where multiple teams proposed reconstruction-based solutions. Experimental results show that while strong vision--language baselines achieve competitive performance, geometry-based reconstruction methods provide both improved accuracy and greater robustness, with the top-performing approach achieving 0.21 MAPE in volume estimation and 5.7 L1 Chamfer Distance in geometric accuracy.

Related papers

Size Matters: Reconstructing Real-Scale 3D Models from Monocular Images for Food Portion Estimation [19.138014263791803]
We bridge the gap between 3D computer vision and digital health by proposing a method that recovers a true-to-scale 3D reconstructed object from a monocular image.<n>Our approach leverages rich visual features extracted from models trained on large-scale datasets to estimate the scale of the reconstructed object.
arXiv Detail & Related papers (2026-01-27T20:53:45Z)
TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning [104.66714520975837]
We introduce a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game.<n>We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications.<n>We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints.
arXiv Detail & Related papers (2026-01-23T07:35:05Z)
PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning [49.66437612420291]
PoseGAM is a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images.<n>We construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions.
arXiv Detail & Related papers (2025-12-11T17:29:25Z)
Gaussian Alignment for Relative Camera Pose Estimation via Single-View Reconstruction [18.936573991468926]
GARPS is a training-free framework that casts this problem as the direct alignment of two independently reconstructed 3D scenes.<n>It refines an initial pose from a feed-forward two-view pose estimator by optimising a differentiable GMM alignment objective.<n>Experiments on the Real-Estate10K dataset demonstrate that GARPS outperforms both classical and state-of-the-art learning-based methods.
arXiv Detail & Related papers (2025-09-17T02:57:34Z)
GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra [33.53387523266523]
We introduce GIQ, a benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models.<n> GIQ comprises synthetic and real-world images of 224 diverse polyhedra.
arXiv Detail & Related papers (2025-06-09T20:11:21Z)
Mono3R: Exploiting Monocular Cues for Geometric 3D Reconstruction [11.220655907305515]
We introduce a monocular-guided refinement module that integrates monocular geometric priors into multi-view reconstruction frameworks.<n>Our method achieves substantial improvements in both mutli-view camera pose estimation and point cloud accuracy.
arXiv Detail & Related papers (2025-04-18T02:33:12Z)
UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image [86.7128543480229]
Unseen object pose estimation methods often rely on CAD models or multiple reference views.<n>To simplify reference acquisition, we aim to estimate the unseen object's pose through a single unposed RGB-D reference image.<n>We present a novel approach and benchmark, termed UNOPose, for unseen one-reference-based object pose estimation.
arXiv Detail & Related papers (2024-11-25T05:36:00Z)
MFP3D: Monocular Food Portion Estimation Leveraging 3D Point Clouds [7.357322789192671]
In this paper, we introduce a new framework for accurate food estimation using only a single monocular image. The framework consists of three key modules: (1) a 3D Reconstruction Module that generates a 3D point cloud representation of the food from the 2D image, (2) a Feature Extraction Module that extracts and represents features from both the 3D point cloud and the 2D RGB image, and (3) a Portion Regression Module that employs a deep regression model to estimate the food's volume and energy content.
arXiv Detail & Related papers (2024-11-14T22:17:27Z)
FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models [67.96827539201071]
We propose a novel test-time optimization approach for 3D scene reconstruction. Our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
arXiv Detail & Related papers (2023-08-10T17:55:02Z)
Single-view 3D Mesh Reconstruction for Seen and Unseen Categories [69.29406107513621]
Single-view 3D Mesh Reconstruction is a fundamental computer vision task that aims at recovering 3D shapes from single-view RGB images. This paper tackles Single-view 3D Mesh Reconstruction, to study the model generalization on unseen categories. We propose an end-to-end two-stage network, GenMesh, to break the category boundaries in reconstruction.
arXiv Detail & Related papers (2022-08-04T14:13:35Z)
Single View Metrology in the Wild [94.7005246862618]
We present a novel approach to single view metrology that can recover the absolute scale of a scene represented by 3D heights of objects or camera height above the ground. Our method relies on data-driven priors learned by a deep network specifically designed to imbibe weakly supervised constraints from the interplay of the unknown camera with 3D entities such as object heights. We demonstrate state-of-the-art qualitative and quantitative results on several datasets as well as applications including virtual object insertion.
arXiv Detail & Related papers (2020-07-18T22:31:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.