Related papers: StereoDETR: Stereo-based Transformer for 3D Object Detection

StereoDETR: Stereo-based Transformer for 3D Object Detection

URL: http://arxiv.org/abs/2511.18788v1
Date: Mon, 24 Nov 2025 05:38:31 GMT
Title: StereoDETR: Stereo-based Transformer for 3D Object Detection
Authors: Shiyi Mu, Zichong Gu, Zhiqi Ai, Anqi Liu, Yilin Gao, Shugong Xu,
Abstract summary: We propose StereoDETR, an efficient stereo 3D object detection framework based on DETR.<n>It achieves twice the accuracy of monocular approaches, yet its inference speed is only half as fast.<n>It also achieves competitive accuracy on the public KITTI benchmark, setting new state-of-the-art results on pedestrian and cyclist subsets.
Score: 29.652689845108046
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Compared to monocular 3D object detection, stereo-based 3D methods offer significantly higher accuracy but still suffer from high computational overhead and latency. The state-of-the-art stereo 3D detection method achieves twice the accuracy of monocular approaches, yet its inference speed is only half as fast. In this paper, we propose StereoDETR, an efficient stereo 3D object detection framework based on DETR. StereoDETR consists of two branches: a monocular DETR branch and a stereo branch. The DETR branch is built upon 2D DETR with additional channels for predicting object scale, orientation, and sampling points. The stereo branch leverages low-cost multi-scale disparity features to predict object-level depth maps. These two branches are coupled solely through a differentiable depth sampling strategy. To handle occlusion, we introduce a constrained supervision strategy for sampling points without requiring extra annotations. StereoDETR achieves real-time inference and is the first stereo-based method to surpass monocular approaches in speed. It also achieves competitive accuracy on the public KITTI benchmark, setting new state-of-the-art results on pedestrian and cyclist subsets. The code is available at https://github.com/shiyi-mu/StereoDETR-OPEN.

Related papers

StereoMV2D: A Sparse Temporal Stereo-Enhanced Framework for Robust Multi-View 3D Object Detection [31.8104389684728]
We propose StereoMV2D, a unified framework that integrates temporal stereo modeling into the 2D detection-guided multi-view 3D detector.<n>By exploiting cross-temporal disparities of the same object across adjacent frames, StereoMV2D enhances depth perception and refines the query priors.<n>Experiments on the nuScenes and Argoverse 2 datasets demonstrate that StereoMV2D achieves superior detection performance without incurring significant computational overhead.
arXiv Detail & Related papers (2025-12-19T14:25:46Z)
SVDM: Single-View Diffusion Model for Pseudo-Stereo 3D Object Detection [0.0]
A recently proposed framework for monocular 3D detection based on Pseudo-Stereo has received considerable attention in the community. In this work, we propose an end-to-end, efficient pseudo-stereo 3D detection framework by introducing a Single-View Diffusion Model. SVDM allows the entire pseudo-stereo 3D detection pipeline to be trained end-to-end and can benefit from the training of stereo detectors.
arXiv Detail & Related papers (2023-07-05T13:10:37Z)
DSGN++: Exploiting Visual-Spatial Relation forStereo-based 3D Detectors [60.88824519770208]
Camera-based 3D object detectors are welcome due to their wider deployment and lower price than LiDAR sensors. We revisit the prior stereo modeling DSGN about the stereo volume constructions for representing both 3D geometry and semantics. We propose our approach, DSGN++, aiming for improving information flow throughout the 2D-to-3D pipeline.
arXiv Detail & Related papers (2022-04-06T18:43:54Z)
MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection [57.969536140562674]
We introduce the first DETR framework for Monocular DEtection with a depth-guided TRansformer, named MonoDETR.<n>We formulate 3D object candidates as learnable queries and propose a depth-guided decoder to conduct object-scene depth interactions.<n>On KITTI benchmark with monocular images as input, MonoDETR achieves state-of-the-art performance and requires no extra dense depth annotations.
arXiv Detail & Related papers (2022-03-24T19:28:54Z)
LIGA-Stereo: Learning LiDAR Geometry Aware Representations for Stereo-based 3D Detector [80.7563981951707]
We propose LIGA-Stereo to learn stereo-based 3D detectors under the guidance of high-level geometry-aware representations of LiDAR-based detection models. Compared with the state-of-the-art stereo detector, our method has improved the 3D detection performance of cars, pedestrians, cyclists by 10.44%, 5.69%, 5.97% mAP respectively.
arXiv Detail & Related papers (2021-08-18T17:24:40Z)
Progressive Coordinate Transforms for Monocular 3D Object Detection [52.00071336733109]
We propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations. In this paper, we propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
arXiv Detail & Related papers (2021-08-12T15:22:33Z)
Shape Prior Non-Uniform Sampling Guided Real-time Stereo 3D Object Detection [59.765645791588454]
Recently introduced RTS3D builds an efficient 4D Feature-Consistency Embedding space for the intermediate representation of object without depth supervision. We propose a shape prior non-uniform sampling strategy that performs dense sampling in outer region and sparse sampling in inner region. Our proposed method has 2.57% improvement on AP3d almost without extra network parameters.
arXiv Detail & Related papers (2021-06-18T09:14:55Z)
M3DSSD: Monocular 3D Single Stage Object Detector [82.25793227026443]
We propose a Monocular 3D Single Stage object Detector (M3DSSD) with feature alignment and asymmetric non-local attention. The proposed M3DSSD achieves significantly better performance than the monocular 3D object detection methods on the KITTI dataset.
arXiv Detail & Related papers (2021-03-24T13:09:11Z)
YOLOStereo3D: A Step Back to 2D for Efficient Stereo 3D Detection [6.5702792909006735]
YOLOStereo3D is trained on one single GPU and runs at more than ten fps. It demonstrates performance comparable to state-of-the-art stereo 3D detection frameworks without usage of LiDAR data.
arXiv Detail & Related papers (2021-03-17T03:43:54Z)
Stereo Frustums: A Siamese Pipeline for 3D Object Detection [20.443003989363916]
The paper proposes a light-weighted stereo frustums matching module for 3D objection detection. The proposed framework takes advantage of a high-performance 2D detector and a point cloud segmentation network to regress 3D bounding boxes for autonomous driving vehicles.
arXiv Detail & Related papers (2020-10-27T20:46:17Z)
SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation [3.1542695050861544]
Estimating 3D orientation and translation of objects is essential for infrastructure-less autonomous navigation and driving. We propose a novel 3D object detection method, named SMOKE, that combines a single keypoint estimate with regressed 3D variables. Despite of its structural simplicity, our proposed SMOKE network outperforms all existing monocular 3D detection methods on the KITTI dataset.
arXiv Detail & Related papers (2020-02-24T08:15:36Z)
DSGN: Deep Stereo Geometry Network for 3D Object Detection [79.16397166985706]
There is a large performance gap between image-based and LiDAR-based 3D object detectors. Our method, called Deep Stereo Geometry Network (DSGN), significantly reduces this gap. For the first time, we provide a simple and effective one-stage stereo-based 3D detection pipeline.
arXiv Detail & Related papers (2020-01-10T11:44:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.