CA-W3D: Leveraging Context-Aware Knowledge for Weakly Supervised Monocular 3D Detection
- URL: http://arxiv.org/abs/2503.04154v2
- Date: Sun, 03 Aug 2025 03:19:51 GMT
- Title: CA-W3D: Leveraging Context-Aware Knowledge for Weakly Supervised Monocular 3D Detection
- Authors: Chupeng Liu, Runkai Zhao, Weidong Cai,
- Abstract summary: We propose a Context-Aware Weak Supervision for Monocular 3D object detection, namely CA-W3D, to address this limitation in a two-stage training paradigm.<n>Specifically, we first introduce a pre-training stage employing Region-wise Object Contrastive Matching (ROCM), which aligns regional object embeddings derived from a trainable monocular 3D encoder and a frozen open-vocabulary 2D visual grounding model.<n>In the second stage, we incorporate a pseudo-label training process with a Dual-to-One Distillation (D2OD) mechanism, which effectively transfers contextual priors into
- Score: 5.881158575425763
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Weakly supervised monocular 3D detection, while less annotation-intensive, often struggles to capture the global context required for reliable 3D reasoning. Conventional label-efficient methods focus on object-centric features, neglecting contextual semantic relationships that are critical in complex scenes. In this work, we propose a Context-Aware Weak Supervision for Monocular 3D object detection, namely CA-W3D, to address this limitation in a two-stage training paradigm. Specifically, we first introduce a pre-training stage employing Region-wise Object Contrastive Matching (ROCM), which aligns regional object embeddings derived from a trainable monocular 3D encoder and a frozen open-vocabulary 2D visual grounding model. This alignment encourages the monocular encoder to discriminate scene-specific attributes and acquire richer contextual knowledge. In the second stage, we incorporate a pseudo-label training process with a Dual-to-One Distillation (D2OD) mechanism, which effectively transfers contextual priors into the monocular encoder while preserving spatial fidelity and maintaining computational efficiency during inference. Extensive experiments conducted on the public KITTI benchmark demonstrate the effectiveness of our approach, surpassing the SoTA method over all metrics, highlighting the importance of contextual-aware knowledge in weakly-supervised monocular 3D detection.
Related papers
- VOIC: Visible-Occluded Decoupling for Monocular 3D Semantic Scene Completion [6.144392125326462]
Camera-based 3D Semantic Scene Completion is a critical task for autonomous driving and robotic scene understanding.<n>Existing methods typically focus on end-to-end 2D-to-3D feature lifting and voxel completion.<n>We propose a novel dual-decoder framework that explicitly decouples SSC into visible-region semantic perception and occluded-region scene completion.
arXiv Detail & Related papers (2025-12-22T02:05:45Z) - IDEAL-M3D: Instance Diversity-Enriched Active Learning for Monocular 3D Detection [42.50500002758336]
I-M3D is the first instance-level pipeline for monocular 3D detection.<n>We induce diversity with heterogeneous backbones and task-agnostic features.<n>We achieve similar or better AP3D on KITTI validation and test set results compared to training the same detector on the whole.
arXiv Detail & Related papers (2025-11-24T16:49:20Z) - Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance [8.07701188057789]
We introduce a novel semi-supervised framework to alleviate the dependency on densely annotated data.<n>Our approach leverages 2D foundation models to generate essential 3D scene geometric and semantic cues.<n>Our method achieves up to 85% of the fully-supervised performance using only 10% labeled data.
arXiv Detail & Related papers (2024-08-21T12:13:18Z) - 3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance [68.8825501902835]
3DSS-VLG is a weakly supervised approach for 3D Semantic with 2D Vision-Language Guidance.
To the best of our knowledge, this is the first work to investigate 3D weakly supervised semantic segmentation by using the textual semantic information of text category labels.
arXiv Detail & Related papers (2024-07-13T09:39:11Z) - Scalable Vision-Based 3D Object Detection and Monocular Depth Estimation
for Autonomous Driving [5.347428263669927]
This dissertation is a multifaceted contribution to the advancement of vision-based 3D perception technologies.
In the first segment, the thesis introduces structural enhancements to both monocular and stereo 3D object detection algorithms.
The second segment is devoted to data-driven strategies and their real-world applications in 3D vision detection.
arXiv Detail & Related papers (2024-03-04T13:42:54Z) - A Review and A Robust Framework of Data-Efficient 3D Scene Parsing with
Traditional/Learned 3D Descriptors [10.497309421830671]
Existing state-of-the-art 3D point cloud understanding methods merely perform well in a fully supervised manner.
This work presents a general and simple framework to tackle point cloud understanding when labels are limited.
arXiv Detail & Related papers (2023-12-03T02:51:54Z) - Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.<n>To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.<n>In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - ODM3D: Alleviating Foreground Sparsity for Semi-Supervised Monocular 3D
Object Detection [15.204935788297226]
ODM3D framework entails cross-modal knowledge distillation at various levels to inject LiDAR-domain knowledge into a monocular detector during training.
By identifying foreground sparsity as the main culprit behind existing methods' suboptimal training, we exploit the precise localisation information embedded in LiDAR points.
Our method ranks 1st in both KITTI validation and test benchmarks, significantly surpassing all existing monocular methods, supervised or semi-supervised.
arXiv Detail & Related papers (2023-10-28T07:12:09Z) - NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization [80.3424839706698]
We present NeurOCS, a framework that uses instance masks 3D boxes as input to learn 3D object shapes by means of differentiable rendering.
Our approach rests on insights in learning a category-level shape prior directly from real driving scenes.
We make critical design choices to learn object coordinates more effectively from an object-centric view.
arXiv Detail & Related papers (2023-05-28T16:18:41Z) - The Devil is in the Task: Exploiting Reciprocal Appearance-Localization
Features for Monocular 3D Object Detection [62.1185839286255]
Low-cost monocular 3D object detection plays a fundamental role in autonomous driving.
We introduce a Dynamic Feature Reflecting Network, named DFR-Net.
We rank 1st among all the monocular 3D object detectors in the KITTI test set.
arXiv Detail & Related papers (2021-12-28T07:31:18Z) - 3D Spatial Recognition without Spatially Labeled 3D [127.6254240158249]
We introduce WyPR, a Weakly-supervised framework for Point cloud Recognition.
We show that WyPR can detect and segment objects in point cloud data without access to any spatial labels at training time.
arXiv Detail & Related papers (2021-05-13T17:58:07Z) - M3DSSD: Monocular 3D Single Stage Object Detector [82.25793227026443]
We propose a Monocular 3D Single Stage object Detector (M3DSSD) with feature alignment and asymmetric non-local attention.
The proposed M3DSSD achieves significantly better performance than the monocular 3D object detection methods on the KITTI dataset.
arXiv Detail & Related papers (2021-03-24T13:09:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.