Related papers: MVAT: Multi-View Aware Teacher for Weakly Supervised 3D Object Detection

MVAT: Multi-View Aware Teacher for Weakly Supervised 3D Object Detection

URL: http://arxiv.org/abs/2509.07507v1
Date: Tue, 09 Sep 2025 08:40:54 GMT
Title: MVAT: Multi-View Aware Teacher for Weakly Supervised 3D Object Detection
Authors: Saad Lahlali, Alexandre Fournier Montgieux, Nicolas Granger, Hervé Le Borgne, Quoc Cuong Pham,
Abstract summary: Annotating 3D data remains a costly bottleneck for 3D object detection.<n>We propose MVAT, a novel framework that leverages temporal multi-view present in sequential data to address these challenges.<n>Our approach aggregates object-centric point clouds across time to build 3D object representations as dense and complete as possible.
Score: 42.38502124189271
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Annotating 3D data remains a costly bottleneck for 3D object detection, motivating the development of weakly supervised annotation methods that rely on more accessible 2D box annotations. However, relying solely on 2D boxes introduces projection ambiguities since a single 2D box can correspond to multiple valid 3D poses. Furthermore, partial object visibility under a single viewpoint setting makes accurate 3D box estimation difficult. We propose MVAT, a novel framework that leverages temporal multi-view present in sequential data to address these challenges. Our approach aggregates object-centric point clouds across time to build 3D object representations as dense and complete as possible. A Teacher-Student distillation paradigm is employed: The Teacher network learns from single viewpoints but targets are derived from temporally aggregated static objects. Then the Teacher generates high quality pseudo-labels that the Student learns to predict from a single viewpoint for both static and moving objects. The whole framework incorporates a multi-view 2D projection loss to enforce consistency between predicted 3D boxes and all available 2D annotations. Experiments on the nuScenes and Waymo Open datasets demonstrate that MVAT achieves state-of-the-art performance for weakly supervised 3D object detection, significantly narrowing the gap with fully supervised methods without requiring any 3D box annotations. % \footnote{Code available upon acceptance} Our code is available in our public repository (\href{https://github.com/CEA-LIST/MVAT}{code}).

Related papers

VSRD++: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering [18.77072205559739]
VSRD++ is a novel weakly supervised framework for monocular 3D object detection.<n>It eliminates the reliance on 3D annotations and leverages neural-field-based volumetric rendering.<n>In the monocular 3D object detection phase, the optimized 3D bounding boxes serve as pseudo labels.
arXiv Detail & Related papers (2025-12-01T01:28:35Z)
PLOT: Pseudo-Labeling via Video Object Tracking for Scalable Monocular 3D Object Detection [35.524943073010675]
Monocular 3D object detection (M3OD) has long faced challenges due to data scarcity caused by high annotation costs and inherent 2D-to-3D ambiguity.<n>We propose a novel pseudo-labeling framework that uses only video data and is more robust to occlusion, without requiring a multi-view setup, additional sensors, camera poses, or domain-specific training.
arXiv Detail & Related papers (2025-07-03T07:46:39Z)
General Geometry-aware Weakly Supervised 3D Object Detection [62.26729317523975]
A unified framework is developed for learning 3D object detectors from RGB images and associated 2D boxes. Experiments on KITTI and SUN-RGBD datasets demonstrate that our method yields surprisingly high-quality 3D bounding boxes with only 2D annotation.
arXiv Detail & Related papers (2024-07-18T17:52:08Z)
Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance [72.6809373191638]
We propose a framework to study how to leverage constraints between 2D and 3D domains without requiring any 3D labels. Specifically, we design a feature-level constraint to align LiDAR and image features based on object-aware regions. Second, the output-level constraint is developed to enforce the overlap between 2D and projected 3D box estimations. Third, the training-level constraint is utilized by producing accurate and consistent 3D pseudo-labels that align with the visual data.
arXiv Detail & Related papers (2023-12-12T18:57:25Z)
Weakly Supervised 3D Object Detection with Multi-Stage Generalization [62.96670547848691]
We introduce BA$2$-Det, encompassing pseudo label generation and multi-stage generalization. We develop three stages of generalization: progressing from complete to partial, static to dynamic, and close to distant. BA$2$-Det can achieve a 20% relative improvement on the KITTI dataset.
arXiv Detail & Related papers (2023-06-08T17:58:57Z)
Tracking Objects with 3D Representation from Videos [57.641129788552675]
We propose a new 2D Multiple Object Tracking paradigm, called P3DTrack. With 3D object representation learning from Pseudo 3D object labels in monocular videos, we propose a new 2D MOT paradigm, called P3DTrack.
arXiv Detail & Related papers (2023-06-08T17:58:45Z)
Weakly Supervised Monocular 3D Object Detection using Multi-View Projection and Direction Consistency [78.76508318592552]
Monocular 3D object detection has become a mainstream approach in automatic driving for its easy application. Most current methods still rely on 3D point cloud data for labeling the ground truths used in the training phase. We propose a new weakly supervised monocular 3D objection detection method, which can train the model with only 2D labels marked on images.
arXiv Detail & Related papers (2023-03-15T15:14:00Z)
Weakly Supervised 3D Object Detection from Point Clouds [27.70180601788613]
3D object detection aims to detect and localize the 3D bounding boxes of objects belonging to specific classes. Existing 3D object detectors rely on annotated 3D bounding boxes during training, while these annotations could be expensive to obtain and only accessible in limited scenarios. We propose VS3D, a framework for weakly supervised 3D object detection from point clouds without using any ground truth 3D bounding box for training.
arXiv Detail & Related papers (2020-07-28T03:30:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.