PortionNet: Distilling 3D Geometric Knowledge for Food Nutrition Estimation
- URL: http://arxiv.org/abs/2512.22304v1
- Date: Fri, 26 Dec 2025 04:50:57 GMT
- Title: PortionNet: Distilling 3D Geometric Knowledge for Food Nutrition Estimation
- Authors: Darrin Bright, Rakshith Raj, Kanchan Keisham,
- Abstract summary: PortionNet is a novel framework that learns geometric features from point clouds during training while requiring only RGB images at inference.<n>Our approach employs a dual-mode training strategy where a lightweight adapter network mimics point cloud representations, enabling pseudo-3D reasoning without any specialized hardware requirements.<n>PortionNet achieves state-of-the-art performance on MetaFood3D, outperforming all previous methods in both volume and energy estimation.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate food nutrition estimation from single images is challenging due to the loss of 3D information. While depth-based methods provide reliable geometry, they remain inaccessible on most smartphones because of depth-sensor requirements. To overcome this challenge, we propose PortionNet, a novel cross-modal knowledge distillation framework that learns geometric features from point clouds during training while requiring only RGB images at inference. Our approach employs a dual-mode training strategy where a lightweight adapter network mimics point cloud representations, enabling pseudo-3D reasoning without any specialized hardware requirements. PortionNet achieves state-of-the-art performance on MetaFood3D, outperforming all previous methods in both volume and energy estimation. Cross-dataset evaluation on SimpleFood45 further demonstrates strong generalization in energy estimation.
Related papers
- VLM6D: VLM based 6Dof Pose Estimation based on RGB-D Images [7.044221981512693]
VLM6D is a novel dual-stream architecture that leverages the strengths of visual and geometric data from RGB-D input for robust and precise pose estimation.<n>We demonstrate through comprehensive experiments that VLM6D obtained new SOTA performance on the challenging Occluded-LineMOD.
arXiv Detail & Related papers (2025-10-31T05:26:41Z) - VolE: A Point-cloud Framework for Food 3D Reconstruction and Volume Estimation [4.621139625109643]
We present VolE, a novel framework that leverages mobile device-driven 3D reconstruction to estimate food volume.<n>VolE captures images and camera locations in free motion to generate precise 3D models, thanks to AR-capable mobile devices.<n>Our experiments demonstrate that VolE outperforms the existing volume estimation techniques across multiple datasets by achieving 2.22 % MAPE.
arXiv Detail & Related papers (2025-05-15T12:03:05Z) - BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence [11.91274849875519]
We introduce a novel image-centric 3D perception model, BIP3D, to overcome the limitations of point-centric methods.<n>We leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding.<n>In our experiments, BIP3D outperforms current state-of-the-art results on the EmbodiedScan benchmark, achieving improvements of 5.69% in the 3D detection task and 15.25% in the 3D visual grounding task.
arXiv Detail & Related papers (2024-11-22T11:35:42Z) - MFP3D: Monocular Food Portion Estimation Leveraging 3D Point Clouds [7.357322789192671]
In this paper, we introduce a new framework for accurate food estimation using only a single monocular image.
The framework consists of three key modules: (1) a 3D Reconstruction Module that generates a 3D point cloud representation of the food from the 2D image, (2) a Feature Extraction Module that extracts and represents features from both the 3D point cloud and the 2D RGB image, and (3) a Portion Regression Module that employs a deep regression model to estimate the food's volume and energy content.
arXiv Detail & Related papers (2024-11-14T22:17:27Z) - MetaFood3D: 3D Food Dataset with Nutrition Values [52.16894900096017]
This dataset consists of 743 meticulously scanned and labeled 3D food objects across 131 categories.<n>Our MetaFood3D dataset emphasizes intra-class diversity and includes rich modalities such as textured mesh files, RGB-D videos, and segmentation masks.
arXiv Detail & Related papers (2024-09-03T15:02:52Z) - GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision [49.839374549646884]
This paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception.<n>Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone.
arXiv Detail & Related papers (2024-05-17T07:31:20Z) - NutritionVerse-Thin: An Optimized Strategy for Enabling Improved
Rendering of 3D Thin Food Models [66.77685168785152]
We present an optimized strategy for enabling improved rendering of thin 3D food models.
Our method generates the 3D model mesh via a proposed thin-object-optimized differentiable reconstruction method.
While simple, we find that this technique can be employed for quick and highly consistent capturing of thin 3D objects.
arXiv Detail & Related papers (2023-04-12T05:34:32Z) - Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z) - PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal
Distillation for 3D Shape Recognition [55.38462937452363]
We propose a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student.
By pair-wise aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification.
arXiv Detail & Related papers (2022-07-07T07:23:20Z) - GDRNPP: A Geometry-guided and Fully Learning-based Object Pose Estimator [51.89441403642665]
6D pose estimation of rigid objects is a long-standing and challenging task in computer vision.<n>Recently, the emergence of deep learning reveals the potential of Convolutional Neural Networks (CNNs) to predict reliable 6D poses.<n>This paper introduces a fully learning-based object pose estimator.
arXiv Detail & Related papers (2021-02-24T09:11:31Z) - Partially Supervised Multi-Task Network for Single-View Dietary
Assessment [1.8907108368038217]
We propose a network architecture that jointly performs geometric understanding (i.e., depth prediction and 3D plane estimation) and semantic prediction on a single food image.
For the training of the network, only monocular videos with semantic ground truth are required, while the depth map and 3D plane ground truth are no longer needed.
Experimental results on two separate food image databases demonstrate that our method performs robustly on texture-less scenarios.
arXiv Detail & Related papers (2020-07-15T08:53:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.