VSRD++: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering
- URL: http://arxiv.org/abs/2512.01178v1
- Date: Mon, 01 Dec 2025 01:28:35 GMT
- Title: VSRD++: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering
- Authors: Zihua Liu, Hiroki Sakuma, Masatoshi Okutomi,
- Abstract summary: VSRD++ is a novel weakly supervised framework for monocular 3D object detection.<n>It eliminates the reliance on 3D annotations and leverages neural-field-based volumetric rendering.<n>In the monocular 3D object detection phase, the optimized 3D bounding boxes serve as pseudo labels.
- Score: 18.77072205559739
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Monocular 3D object detection is a fundamental yet challenging task in 3D scene understanding. Existing approaches heavily depend on supervised learning with extensive 3D annotations, which are often acquired from LiDAR point clouds through labor-intensive labeling processes. To tackle this problem, we propose VSRD++, a novel weakly supervised framework for monocular 3D object detection that eliminates the reliance on 3D annotations and leverages neural-field-based volumetric rendering with weak 2D supervision. VSRD++ consists of a two-stage pipeline: multi-view 3D autolabeling and subsequent monocular 3D detector training. In the multi-view autolabeling stage, object surfaces are represented as signed distance fields (SDFs) and rendered as instance masks via the proposed instance-aware volumetric silhouette rendering. To optimize 3D bounding boxes, we decompose each instance's SDF into a cuboid SDF and a residual distance field (RDF) that captures deviations from the cuboid. To address the geometry inconsistency commonly observed in volume rendering methods applied to dynamic objects, we model the dynamic objects by including velocity into bounding box attributes as well as assigning confidence to each pseudo-label. Moreover, we also employ a 3D attribute initialization module to initialize the dynamic bounding box parameters. In the monocular 3D object detection phase, the optimized 3D bounding boxes serve as pseudo labels for training monocular 3D object detectors. Extensive experiments on the KITTI-360 dataset demonstrate that VSRD++ significantly outperforms existing weakly supervised approaches for monocular 3D object detection on both static and dynamic scenes. Code is available at https://github.com/Magicboomliu/VSRD_plus_plus
Related papers
- LabelAny3D: Label Any Object 3D in the Wild [18.044792932630752]
COCO3D is a new benchmark for open-vocabulary monocular 3D detection, derived from the MS-COCO dataset.<n>We introduce LabelAny3D, an emphanalysis-by-synthesis framework that reconstructs holistic 3D scenes from 2D images to efficiently produce high-quality 3D bounding box annotations.
arXiv Detail & Related papers (2026-01-04T22:03:45Z) - Sparse Multiview Open-Vocabulary 3D Detection [27.57172918603858]
3D object detection has traditionally been solved by training to detect a fixed set of categories.<n>In this work, we investigate open-vocabulary 3D object detection in the challenging yet practical sparse-view setting.<n>Our approach is training-free, relying on pre-trained, off-the-shelf 2D foundation models instead of employing computationally expensive 3D feature fusion.
arXiv Detail & Related papers (2025-09-19T12:22:24Z) - 3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection [62.57179069154312]
We introduce the first end-to-end 3D Monocular Open-set Object Detector (3D-MOOD)<n>We lift the open-set 2D detection into 3D space through our designed 3D bounding box head.<n>We condition the object queries with geometry prior and overcome the generalization for 3D estimation across diverse scenes.
arXiv Detail & Related papers (2025-07-31T13:56:41Z) - PLOT: Pseudo-Labeling via Video Object Tracking for Scalable Monocular 3D Object Detection [35.524943073010675]
Monocular 3D object detection (M3OD) has long faced challenges due to data scarcity caused by high annotation costs and inherent 2D-to-3D ambiguity.<n>We propose a novel pseudo-labeling framework that uses only video data and is more robust to occlusion, without requiring a multi-view setup, additional sensors, camera poses, or domain-specific training.
arXiv Detail & Related papers (2025-07-03T07:46:39Z) - SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts [13.349110509879312]
sparsely-supervised 3D object detection has gained great attention, achieving performance close to fully-supervised 3D objectors.<n>We propose a boosting strategy, termed SP3D, to boost the 3D detector with robust feature discrimination capability under sparse annotation settings.<n> Experiments have validated that SP3D can enhance the performance of sparsely supervised detectors by a large margin under meager labeling conditions.
arXiv Detail & Related papers (2025-03-09T06:08:04Z) - V-MIND: Building Versatile Monocular Indoor 3D Detector with Diverse 2D Annotations [17.49394091283978]
V-MIND (Versatile Monocular INdoor Detector) enhances the performance of indoor 3D detectors across a diverse set of object classes.<n>We generate 3D training data by converting large-scale 2D images into 3D point clouds and subsequently deriving pseudo 3D bounding boxes.<n>V-MIND achieves state-of-the-art object detection performance across a wide range of classes on the Omni3D indoor dataset.
arXiv Detail & Related papers (2024-12-16T03:28:00Z) - Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data [57.53523870705433]
We propose a novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det.
OVM3D-Det does not require high-precision LiDAR or 3D sensor data for either input or generating 3D bounding boxes.
It employs open-vocabulary 2D models and pseudo-LiDAR to automatically label 3D objects in RGB images, fostering the learning of open-vocabulary monocular 3D detectors.
arXiv Detail & Related papers (2024-11-23T21:37:21Z) - VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection [11.061100776969383]
Monocular 3D object detection poses a significant challenge in 3D scene understanding.
Existing methods heavily rely on supervised learning using abundant 3D labels.
We propose a novel weakly supervised 3D object detection framework named VSRD.
arXiv Detail & Related papers (2024-03-29T20:43:55Z) - Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance [72.6809373191638]
We propose a framework to study how to leverage constraints between 2D and 3D domains without requiring any 3D labels.
Specifically, we design a feature-level constraint to align LiDAR and image features based on object-aware regions.
Second, the output-level constraint is developed to enforce the overlap between 2D and projected 3D box estimations.
Third, the training-level constraint is utilized by producing accurate and consistent 3D pseudo-labels that align with the visual data.
arXiv Detail & Related papers (2023-12-12T18:57:25Z) - Tracking Objects with 3D Representation from Videos [57.641129788552675]
We propose a new 2D Multiple Object Tracking paradigm, called P3DTrack.
With 3D object representation learning from Pseudo 3D object labels in monocular videos, we propose a new 2D MOT paradigm, called P3DTrack.
arXiv Detail & Related papers (2023-06-08T17:58:45Z) - OA-DET3D: Embedding Object Awareness as a General Plug-in for Multi-Camera 3D Object Detection [77.43427778037203]
We introduce OA-DET3D, a plug-in module that improves 3D object detection.<n> OA-DET3D boosts the representation of objects by leveraging object-centric depth information and foreground pseudo points.<n>We conduct extensive experiments on the nuScenes dataset and Argoverse 2 dataset to validate the merits of OA-DET3D.
arXiv Detail & Related papers (2023-01-13T06:02:31Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - FGR: Frustum-Aware Geometric Reasoning for Weakly Supervised 3D Vehicle
Detection [81.79171905308827]
We propose frustum-aware geometric reasoning (FGR) to detect vehicles in point clouds without any 3D annotations.
Our method consists of two stages: coarse 3D segmentation and 3D bounding box estimation.
It is able to accurately detect objects in 3D space with only 2D bounding boxes and sparse point clouds.
arXiv Detail & Related papers (2021-05-17T07:29:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.