DVLO4D: Deep Visual-Lidar Odometry with Sparse Spatial-temporal Fusion
- URL: http://arxiv.org/abs/2509.06023v1
- Date: Sun, 07 Sep 2025 11:43:11 GMT
- Title: DVLO4D: Deep Visual-Lidar Odometry with Sparse Spatial-temporal Fusion
- Authors: Mengmeng Liu, Michael Ying Yang, Jiuming Liu, Yunpeng Zhang, Jiangtao Li, Sander Oude Elberink, George Vosselman, Hao Cheng,
- Abstract summary: We introduce DVLO4D, a novel visual-LiDAR odometry framework that leverages sparse spatial-temporal fusion to enhance accuracy and robustness.<n>Our method has high efficiency, with an inference time of 82 ms, possessing the potential for the real-time deployment.
- Score: 28.146811420532455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual-LiDAR odometry is a critical component for autonomous system localization, yet achieving high accuracy and strong robustness remains a challenge. Traditional approaches commonly struggle with sensor misalignment, fail to fully leverage temporal information, and require extensive manual tuning to handle diverse sensor configurations. To address these problems, we introduce DVLO4D, a novel visual-LiDAR odometry framework that leverages sparse spatial-temporal fusion to enhance accuracy and robustness. Our approach proposes three key innovations: (1) Sparse Query Fusion, which utilizes sparse LiDAR queries for effective multi-modal data fusion; (2) a Temporal Interaction and Update module that integrates temporally-predicted positions with current frame data, providing better initialization values for pose estimation and enhancing model's robustness against accumulative errors; and (3) a Temporal Clip Training strategy combined with a Collective Average Loss mechanism that aggregates losses across multiple frames, enabling global optimization and reducing the scale drift over long sequences. Extensive experiments on the KITTI and Argoverse Odometry dataset demonstrate the superiority of our proposed DVLO4D, which achieves state-of-the-art performance in terms of both pose accuracy and robustness. Additionally, our method has high efficiency, with an inference time of 82 ms, possessing the potential for the real-time deployment.
Related papers
- Depth-Guided Metric-Aware Temporal Consistency for Monocular Video Human Mesh Recovery [28.9705779052755]
We propose a comprehensive framework that achieves metric-aware temporal consistency through three synergistic components.<n>A Depth-Guided Multi-Scale Fusion module adaptively integrates geometric priors with RGB features via confidence-aware gating.<n>A Motion-Depth Aligned Refinement (MoDAR) module that enforces temporal coherence through cross-modal attention between motion dynamics and geometric cues.
arXiv Detail & Related papers (2026-02-04T06:41:03Z) - Forward Consistency Learning with Gated Context Aggregation for Video Anomaly Detection [17.79982215633934]
Video anomaly detection (VAD) aims to measure deviations from normal patterns for various events in real-time surveillance systems.<n>Most existing VAD methods rely on large-scale models to pursue extreme accuracy, limiting their feasibility on resource-limited edge devices.<n>We introduce FoGA, a lightweight VAD model that performs Forward consistency learning with Gated context aggregation.
arXiv Detail & Related papers (2026-01-26T04:35:31Z) - Reloc-VGGT: Visual Re-localization with Geometry Grounded Transformer [40.778996326009185]
We present the first visual localization framework that performs multi-view spatial integration through an early-fusion mechanism.<n>Our framework is built upon the VGGT backbone, which encodes multi-view 3D geometry.<n>We propose a novel sparse mask attention strategy that reduces computational cost by avoiding the quadratic complexity of global attention.
arXiv Detail & Related papers (2025-12-26T06:12:17Z) - Machine Learning for Scientific Visualization: Ensemble Data Analysis [0.0]
This dissertation explores deep learning techniques to improve scientific visualization.<n>We introduce autoencoder-supervised dimensionality reduction for scientific ensembles.<n>Next, we present FLINT, a deep learning model for expressive high-quality flow estimation.<n>Finally, we introduce HyperFLINT, a hypernetwork-based approach to estimate flow fields and interpolate data.
arXiv Detail & Related papers (2025-11-28T15:45:54Z) - HAD: Hierarchical Asymmetric Distillation to Bridge Spatio-Temporal Gaps in Event-Based Object Tracking [80.07224739976911]
Event cameras offer exceptional temporal resolution and a range (modal)<n> RGB cameras excel at capturing rich texture with high resolution, whereas event cameras offer exceptional temporal resolution and a range (modal)
arXiv Detail & Related papers (2025-10-22T13:15:13Z) - Visual Odometry with Transformers [68.453547770334]
We introduce Visual odometry Transformer (VoT), which processes sequences of monocular frames by extracting features.<n>Unlike prior methods, VoT directly predicts camera motion without estimating dense geometry and relies solely on camera poses for supervision.<n>VoT scales effectively with larger datasets, benefits substantially from stronger pre-trained backbones, generalizes across diverse camera motions and calibration settings, and outperforms traditional methods while running more than 3 times faster.
arXiv Detail & Related papers (2025-10-02T17:00:14Z) - Temporal and Rotational Calibration for Event-Centric Multi-Sensor Systems [24.110040599070796]
Event cameras generate asynchronous signals in response to pixel-level brightness changes.<n>We propose a motion-based temporal and rotational calibration framework tailored for event-centric multi-sensor systems.
arXiv Detail & Related papers (2025-08-18T01:53:27Z) - SCENT: Robust Spatiotemporal Learning for Continuous Scientific Data via Scalable Conditioned Neural Fields [11.872753517172555]
We present SCENT, a novel framework for scalable and continuity-informed modeling learning.<n>SCENT unifies representation, reconstruction, and forecasting within a single architecture.<n>We validate SCENT through extensive simulations and real-world experiments, demonstrating state-of-the-art performance.
arXiv Detail & Related papers (2025-04-16T17:17:31Z) - SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining [62.433137130087445]
SuperFlow++ is a novel framework that integrates pretraining and downstream tasks using consecutive camera pairs.<n>We show that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions.<n>With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
arXiv Detail & Related papers (2025-03-25T17:59:57Z) - Universal Online Temporal Calibration for Optimization-based Visual-Inertial Navigation Systems [13.416013522770905]
We propose a universal online temporal calibration strategy for optimization-based visual-inertial navigation systems.<n>We use the time offset td as a state parameter in the optimization residual model to align the IMU state to the corresponding image timestamp.<n>Our approach provides more accurate time offset estimation and faster convergence, particularly in the presence of noisy sensor data.
arXiv Detail & Related papers (2025-01-03T12:41:25Z) - Temporal Feature Matters: A Framework for Diffusion Model Quantization [105.3033493564844]
Diffusion models rely on the time-step for the multi-round denoising.<n>We introduce a novel quantization framework that includes three strategies.<n>This framework preserves most of the temporal information and ensures high-quality end-to-end generation.
arXiv Detail & Related papers (2024-07-28T17:46:15Z) - StreamLTS: Query-based Temporal-Spatial LiDAR Fusion for Cooperative Object Detection [0.552480439325792]
We propose Time-Aligned COoperative Object Detection (TA-COOD), for which we adapt widely used dataset OPV2V and DairV2X.
Experiment results confirm the superior efficiency of our fully sparse framework compared to the state-of-the-art dense models.
arXiv Detail & Related papers (2024-07-04T10:56:10Z) - Enhanced Spatio-Temporal Context for Temporally Consistent Robust 3D
Human Motion Recovery from Monocular Videos [5.258814754543826]
We propose a novel method for temporally consistent motion estimation from a monocular video.
Instead of using generic ResNet-like features, our method uses a body-aware feature representation and an independent per-frame pose.
Our method attains significantly lower acceleration error and outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2023-11-20T10:53:59Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z) - Self-Supervised Multi-Frame Monocular Scene Flow [61.588808225321735]
We introduce a multi-frame monocular scene flow network based on self-supervised learning.
We observe state-of-the-art accuracy among monocular scene flow methods based on self-supervised learning.
arXiv Detail & Related papers (2021-05-05T17:49:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.