Related papers: MonoVQD: Monocular 3D Object Detection with Variational Query Denoising and Self-Distillation

MonoVQD: Monocular 3D Object Detection with Variational Query Denoising and Self-Distillation

URL: http://arxiv.org/abs/2506.14835v1
Date: Sat, 14 Jun 2025 14:49:12 GMT
Title: MonoVQD: Monocular 3D Object Detection with Variational Query Denoising and Self-Distillation
Authors: Kiet Dang Vu, Trung Thai Tran, Duc Dung Nguyen,
Abstract summary: We introduce MonoVQD, a novel framework designed to advance DETR-based monocular 3D detection.<n>Mask Separated Self-Attention mechanism enables the integration of the denoising process into a DETR architecture.<n>We present the Variational Query Denoising technique to address the vanishing problem of conventional denoising methods.<n> Rigorous experimentation demonstrates that MonoVQD achieves superior performance on the challenging KITTI monocular benchmark.
Score: 0.6144680854063939
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Precisely localizing 3D objects from a single image constitutes a central challenge in monocular 3D detection. While DETR-like architectures offer a powerful paradigm, their direct application in this domain encounters inherent limitations, preventing optimal performance. Our work addresses these challenges by introducing MonoVQD, a novel framework designed to fundamentally advance DETR-based monocular 3D detection. We propose three main contributions. First, we propose the Mask Separated Self-Attention mechanism that enables the integration of the denoising process into a DETR architecture. This improves the stability of Hungarian matching to achieve a consistent optimization objective. Second, we present the Variational Query Denoising technique to address the gradient vanishing problem of conventional denoising methods, which severely restricts the efficiency of the denoising process. This explicitly introduces stochastic properties to mitigate this fundamental limitation and unlock substantial performance gains. Finally, we introduce a sophisticated self-distillation strategy, leveraging insights from later decoder layers to synergistically improve query quality in earlier layers, thereby amplifying the iterative refinement process. Rigorous experimentation demonstrates that MonoVQD achieves superior performance on the challenging KITTI monocular benchmark. Highlighting its broad applicability, MonoVQD's core components seamlessly integrate into other architectures, delivering significant performance gains even in multi-view 3D detection scenarios on the nuScenes dataset and underscoring its robust generalization capabilities.

Related papers

One-Shot Refiner: Boosting Feed-forward Novel View Synthesis via One-Step Diffusion [57.824020826432815]
We present a novel framework for high-fidelity novel view synthesis (NVS) from sparse images.<n>We design a Dual-Domain Detail Perception Module, which enables handling high-resolution images without being limited by the ViT backbone.<n>We develop a feature-guided diffusion network, which can preserve high-frequency details during the restoration process.
arXiv Detail & Related papers (2026-01-20T17:11:55Z)
Mono3DV: Monocular 3D Object Detection with 3D-Aware Bipartite Matching and Variational Query DeNoising [0.6423989407081764]
Mono3DV is a novel Transformer-based framework for 3D object detection.<n>We develop a 3D-Aware Bipartite Matching strategy that directly incorporates 3D geometric information into the matching cost.<n>Second, it is important to stabilize the Bipartite Matching to resolve the instability occurring when integrating 3D attributes.
arXiv Detail & Related papers (2026-01-03T02:06:28Z)
RobustSplat++: Decoupling Densification, Dynamics, and Illumination for In-the-Wild 3DGS [85.90134051583368]
3D Gaussian Splatting (3DGS) has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling.<n>Existing methods struggle with accurately modeling in-the-wild scenes affected by transient objects and illuminations.<n>We propose RobustSplat++, a robust solution based on several critical designs.
arXiv Detail & Related papers (2025-12-04T14:05:09Z)
Difficulty-Aware Label-Guided Denoising for Monocular 3D Object Detection [17.487124484503322]
We propose MonoDLGD, a novel Difficulty-Aware Label-Guided Denoising framework.<n>MonoDLGD adaptively perturbs and reconstructs ground-truth labels based on detection uncertainty.<n>Experiments on the KITTI benchmark demonstrate that MonoDLGD achieves state-of-the-art performance across all difficulty levels.
arXiv Detail & Related papers (2025-11-17T10:02:18Z)
High-Quality Proposal Encoding and Cascade Denoising for Imaginary Supervised Object Detection [20.075203668387136]
Existing object detection methods suffer from simplistic prompts, poor image quality, and weak supervision.<n>We propose Cascade HQP-DETR to address these limitations.<n>First, we introduce a high-quality data pipeline using LLaMA-3, Flux, and Grounding DINO to generate the FluxVOC and FluxCOCO datasets.<n>Second, our High-Quality Proposal guided query encodings object queries with image-specific priors from SAM-generated proposals.<n>Third, our cascade denoising algorithm dynamically adjusts training weights through progressively increasing IoU thresholds across decoder layers.
arXiv Detail & Related papers (2025-11-11T09:19:56Z)
CLoD-GS: Continuous Level-of-Detail via 3D Gaussian Splatting [7.764273859026904]
We introduce CLoD-GS, a framework that integrates a continuous LoD mechanism directly into a 3DGS representation.<n>CLoD-GS achieves smooth, quality-scalable rendering from a single model, delivering high-fidelity results across a range of performance targets.
arXiv Detail & Related papers (2025-10-11T03:48:11Z)
RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS [79.15416002879239]
3D Gaussian Splatting has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling.<n>Existing methods struggle with accurately modeling scenes affected by transient objects, leading to artifacts in the rendered images.<n>We propose RobustSplat, a robust solution based on two critical designs.
arXiv Detail & Related papers (2025-06-03T11:13:48Z)
Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy Video [30.89206445146674]
We aim to redefine robust ego-motion estimation and photorealistic 3D reconstruction by addressing a critical limitation: reliance on noise-free data.<n>We tackle three core challenges: scalable data generation, comprehensive robustness, and model enhancement.<n>We create Robust-Ego3D, a benchmark rigorously designed to expose noise-induced performance degradation.
arXiv Detail & Related papers (2025-01-24T08:25:48Z)
Feature Attenuation of Defective Representation Can Resolve Incomplete Masking on Anomaly Detection [1.0358639819750703]
In unsupervised anomaly detection (UAD) research, it is necessary to develop a computationally efficient and scalable solution. We revisit the reconstruction-by-inpainting approach and rethink to improve it by analyzing strengths and weaknesses. We propose Feature Attenuation of Defective Representation (FADeR) that only employs two layers which attenuates feature information of anomaly reconstruction.
arXiv Detail & Related papers (2024-07-05T15:44:53Z)
Unsupervised Domain Adaptation for Monocular 3D Object Detection via Self-Training [57.25828870799331]
We propose STMono3D, a new self-teaching framework for unsupervised domain adaptation on Mono3D. We develop a teacher-student paradigm to generate adaptive pseudo labels on the target domain. STMono3D achieves remarkable performance on all evaluated datasets and even surpasses fully supervised results on the KITTI 3D object detection dataset.
arXiv Detail & Related papers (2022-04-25T12:23:07Z)
Stereo Neural Vernier Caliper [57.187088191829886]
We propose a new object-centric framework for learning-based stereo 3D object detection. We tackle a problem of how to predict a refined update given an initial 3D cuboid guess. Our approach achieves state-of-the-art performance on the KITTI benchmark.
arXiv Detail & Related papers (2022-03-21T14:36:07Z)
SGM3D: Stereo Guided Monocular 3D Object Detection [62.11858392862551]
We propose a stereo-guided monocular 3D object detection network, termed SGM3D. We exploit robust 3D features extracted from stereo images to enhance the features learned from the monocular image. Our method can be integrated into many other monocular approaches to boost performance without introducing any extra computational cost.
arXiv Detail & Related papers (2021-12-03T13:57:14Z)
Secrets of 3D Implicit Object Shape Reconstruction in the Wild [92.5554695397653]
Reconstructing high-fidelity 3D objects from sparse, partial observation is crucial for various applications in computer vision, robotics, and graphics. Recent neural implicit modeling methods show promising results on synthetic or dense datasets. But, they perform poorly on real-world data that is sparse and noisy. This paper analyzes the root cause of such deficient performance of a popular neural implicit model.
arXiv Detail & Related papers (2021-01-18T03:24:48Z)
Reinforced Axial Refinement Network for Monocular 3D Object Detection [160.34246529816085]
Monocular 3D object detection aims to extract the 3D position and properties of objects from a 2D input image. Conventional approaches sample 3D bounding boxes from the space and infer the relationship between the target object and each of them, however, the probability of effective samples is relatively small in the 3D space. We propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step. This requires designing a policy which gets a reward after several steps, and thus we adopt reinforcement learning to optimize it.
arXiv Detail & Related papers (2020-08-31T17:10:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.