Breaking Down Monocular Ambiguity: Exploiting Temporal Evolution for 3D Lane Detection
- URL: http://arxiv.org/abs/2504.20525v3
- Date: Wed, 05 Nov 2025 02:18:03 GMT
- Title: Breaking Down Monocular Ambiguity: Exploiting Temporal Evolution for 3D Lane Detection
- Authors: Huan Zheng, Wencheng Han, Tianyi Yan, Cheng-zhong Xu, Jianbing Shen,
- Abstract summary: Monocular 3D lane detection aims to estimate the 3D position of lanes from frontal-view (FV) images.<n>Existing methods are constrained by the inherent ambiguity of single-frame input.<n>We propose to unlock the rich information embedded in the temporal evolution of the scene as the vehicle moves.
- Score: 79.98605061363999
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Monocular 3D lane detection aims to estimate the 3D position of lanes from frontal-view (FV) images. However, existing methods are fundamentally constrained by the inherent ambiguity of single-frame input, which leads to inaccurate geometric predictions and poor lane integrity, especially for distant lanes. To overcome this, we propose to unlock the rich information embedded in the temporal evolution of the scene as the vehicle moves. Our proposed Geometry-aware Temporal Aggregation Network (GTA-Net) systematically leverages the temporal information from complementary perspectives. First, Temporal Geometry Enhancement Module (TGEM) learns geometric consistency across consecutive frames, effectively recovering depth information from motion to build a reliable 3D scene representation. Second, to enhance lane integrity, Temporal Instance-aware Query Generation (TIQG) module aggregates instance cues from past and present frames. Crucially, for lanes that are ambiguous in the current view, TIQG innovatively synthesizes a pseudo future perspective to generate queries that reveal lanes which would otherwise be missed. The experiments demonstrate that GTA-Net achieves new SoTA results, significantly outperforming existing monocular 3D lane detection solutions.
Related papers
- SparseLaneSTP: Leveraging Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection [42.86570387250456]
3D lane detection has emerged as a critical challenge in autonomous driving.<n>We present SparseLaneSTP, a novel method that integrates both geometric properties of the lane structure and temporal information into a sparse lane transformer.<n>It introduces a new lane-specific-temporal attention mechanism, a continuous lane representation tailored for sparse architectures as well as temporal regularization.
arXiv Detail & Related papers (2026-01-08T14:16:11Z) - FastVGGT: Training-Free Acceleration of Visual Geometry Transformer [83.67766078575782]
VGGT is a state-of-the-art feed-forward visual geometry model.<n>We propose FastVGGT, which leverages token merging in the 3D domain through a training-free mechanism for accelerating VGGT.<n>With 1000 input images, FastVGGT achieves a 4x speedup over VGGT while mitigating error accumulation in long-sequence scenarios.
arXiv Detail & Related papers (2025-09-02T17:54:21Z) - MonoDGP: Monocular 3D Object Detection with Decoupled-Query and Geometry-Error Priors [24.753860375872215]
This paper presents a Transformer-based monocular 3D object detection method called MonoDGP.<n>It adopts perspective-invariant geometry errors to modify the projection formula.<n>Our method demonstrates state-of-the-art performance on the KITTI benchmark without extra data.
arXiv Detail & Related papers (2024-10-25T14:31:43Z) - LaneCPP: Continuous 3D Lane Detection using Physical Priors [45.52331418900137]
Lane CPP uses a continuous 3D lane detection model leveraging physical prior knowledge about the lane structure and road geometry.
We show the benefits of our contributions and prove the meaningfulness of using priors to make 3D lane detection more robust.
arXiv Detail & Related papers (2024-06-12T16:31:06Z) - Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting [75.7154104065613]
We introduce a novel depth completion model, trained via teacher distillation and self-training to learn the 3D fusion process.
We also introduce a new benchmarking scheme for scene generation methods that is based on ground truth geometry.
arXiv Detail & Related papers (2024-04-30T17:59:40Z) - Learning Monocular Depth in Dynamic Environment via Context-aware
Temporal Attention [9.837958401514141]
We present CTA-Depth, a Context-aware Temporal Attention guided network for multi-frame monocular Depth estimation.
Our approach achieves significant improvements over state-of-the-art approaches on three benchmark datasets.
arXiv Detail & Related papers (2023-05-12T11:48:32Z) - OPA-3D: Occlusion-Aware Pixel-Wise Aggregation for Monocular 3D Object
Detection [51.153003057515754]
OPA-3D is a single-stage, end-to-end, Occlusion-Aware Pixel-Wise Aggregation network.
It jointly estimates dense scene depth with depth-bounding box residuals and object bounding boxes.
It outperforms state-of-the-art methods on the main Car category.
arXiv Detail & Related papers (2022-11-02T14:19:13Z) - Reconstruct from Top View: A 3D Lane Detection Approach based on
Geometry Structure Prior [19.1954119672487]
We propose an advanced approach in targeting the problem of monocular 3D lane detection by leveraging geometry structure underneath process of 2D to 3D lane reconstruction.
We first analyze the geometry between the 3D lane and its 2D representation on the ground and propose to impose explicit supervision based on the structure prior.
Second, to reduce the structure loss in 2D lane representation, we directly extract top view lane information from front view images.
arXiv Detail & Related papers (2022-06-21T04:03:03Z) - PersFormer: 3D Lane Detection via Perspective Transformer and the
OpenLane Benchmark [109.03773439461615]
PersFormer is an end-to-end monocular 3D lane detector with a novel Transformer-based spatial feature transformation module.
We release one of the first large-scale real-world 3D lane datasets, called OpenLane, with high-quality annotation and scenario diversity.
arXiv Detail & Related papers (2022-03-21T16:12:53Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR-based
Perception [122.53774221136193]
State-of-the-art methods for driving-scene LiDAR-based perception often project the point clouds to 2D space and then process them via 2D convolution.
A natural remedy is to utilize the 3D voxelization and 3D convolution network.
We propose a new framework for the outdoor LiDAR segmentation, where cylindrical partition and asymmetrical 3D convolution networks are designed to explore the 3D geometric pattern.
arXiv Detail & Related papers (2021-09-12T06:25:11Z) - Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection [70.71934539556916]
We learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection.
Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised.
Our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting.
arXiv Detail & Related papers (2021-07-29T12:30:39Z) - 3D Object Detection for Autonomous Driving: A Survey [14.772968858398043]
3D object detection serves as the core basis of such perception system.
Despite existing efforts, 3D object detection on point clouds is still in its infancy.
Recent state-of-the-art detection methods with their pros and cons are presented.
arXiv Detail & Related papers (2021-06-21T03:17:20Z) - Geometry-aware data augmentation for monocular 3D object detection [18.67567745336633]
This paper focuses on monocular 3D object detection, one of the essential modules in autonomous driving systems.
A key challenge is that the depth recovery problem is ill-posed in monocular data.
We conduct a thorough analysis to reveal how existing methods fail to robustly estimate depth when different geometry shifts occur.
We convert the aforementioned manipulations into four corresponding 3D-aware data augmentation techniques.
arXiv Detail & Related papers (2021-04-12T23:12:48Z) - Road Curb Detection and Localization with Monocular Forward-view Vehicle
Camera [74.45649274085447]
We propose a robust method for estimating road curb 3D parameters using a calibrated monocular camera equipped with a fisheye lens.
Our approach is able to estimate the vehicle to curb distance in real time with mean accuracy of more than 90%.
arXiv Detail & Related papers (2020-02-28T00:24:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.