OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
- URL: http://arxiv.org/abs/2602.19035v1
- Date: Sun, 22 Feb 2026 04:18:29 GMT
- Title: OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
- Authors: Phuc D. A. Nguyen, Anh N. Nhu, Ming C. Lin,
- Abstract summary: We introduce OpenVO, a novel framework for Open-world Visual Odometry (VO) with temporal awareness under limited input conditions.<n>OpenVO effectively estimates real-world-scale ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras.
- Score: 20.241033894831357
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce OpenVO, a novel framework for Open-world Visual Odometry (VO) with temporal awareness under limited input conditions. OpenVO effectively estimates real-world-scale ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras, enabling robust trajectory dataset construction from rare driving events recorded in dashcam. Existing VO methods are trained on fixed observation frequency (e.g., 10Hz or 12Hz), completely overlooking temporal dynamics information. Many prior methods also require calibrated cameras with known intrinsic parameters. Consequently, their performance degrades when (1) deployed under unseen observation frequencies or (2) applied to uncalibrated cameras. These significantly limit their generalizability to many downstream tasks, such as extracting trajectories from dashcam footage. To address these challenges, OpenVO (1) explicitly encodes temporal dynamics information within a two-frame pose regression framework and (2) leverages 3D geometric priors derived from foundation models. We validate our method on three major autonomous-driving benchmarks - KITTI, nuScenes, and Argoverse 2 - achieving more than 20 performance improvement over state-of-the-art approaches. Under varying observation rate settings, our method is significantly more robust, achieving 46%-92% lower errors across all metrics. These results demonstrate the versatility of OpenVO for real-world 3D reconstruction and diverse downstream applications.
Related papers
- Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation [49.12018869332346]
InfCam is a camera-controlled video-to-video generation framework with high pose fidelity.<n>The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model.
arXiv Detail & Related papers (2025-12-18T20:03:05Z) - DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images [36.562825380568384]
We introduce textbfDriving Gaussian Grounded Transformer (DGGT), a unified framework for pose-free dynamic scene reconstruction.<n>Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head.<n>A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs.
arXiv Detail & Related papers (2025-12-02T18:29:18Z) - Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence [70.2803680525165]
We introduce Open-o3 Video, a non-agent framework that integrates explicit evidence into video reasoning.<n>The model highlights key objects and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations.<n>On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mL timestamp by 24.2%.
arXiv Detail & Related papers (2025-10-23T14:05:56Z) - Lightweight Multi-Frame Integration for Robust YOLO Object Detection in Videos [11.532574301455854]
We propose a highly effective strategy for multi-frame video object detection.<n>Our method improves robustness, especially for lightweight models.<n>We contribute the BOAT360 benchmark dataset to support future research in multi-frame video object detection in challenging real-world scenarios.
arXiv Detail & Related papers (2025-06-25T15:49:07Z) - DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.<n>These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z) - SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining [62.433137130087445]
SuperFlow++ is a novel framework that integrates pretraining and downstream tasks using consecutive camera pairs.<n>We show that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions.<n>With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
arXiv Detail & Related papers (2025-03-25T17:59:57Z) - DATAP-SfM: Dynamic-Aware Tracking Any Point for Robust Structure from Motion in the Wild [85.03973683867797]
This paper proposes a concise, elegant, and robust pipeline to estimate smooth camera trajectories and obtain dense point clouds for casual videos in the wild.
We show that the proposed method achieves state-of-the-art performance in terms of camera pose estimation even in complex dynamic challenge scenes.
arXiv Detail & Related papers (2024-11-20T13:01:16Z) - Application of 2D Homography for High Resolution Traffic Data Collection
using CCTV Cameras [9.946460710450319]
This study implements a three-stage video analytics framework for extracting high-resolution traffic data from CCTV cameras.
The key components of the framework include object recognition, perspective transformation, and vehicle trajectory reconstruction.
The results of the study showed about +/- 4.5% error rate for directional traffic counts, less than 10% MSE for speed bias between camera estimates.
arXiv Detail & Related papers (2024-01-14T07:33:14Z) - XVO: Generalized Visual Odometry via Cross-Modal Self-Training [11.70220331540621]
XVO is a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models.
In contrast to standard monocular VO approaches which often study a known calibration within a single dataset, XVO efficiently learns to recover relative pose with real-world scale.
We optimize the motion estimation model via self-training from large amounts of unconstrained and heterogeneous dash camera videos available on YouTube.
arXiv Detail & Related papers (2023-09-28T18:09:40Z) - Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object
Detection [58.81316192862618]
Two critical sensors for 3D perception in autonomous driving are the camera and the LiDAR.
fusing these two modalities can significantly boost the performance of 3D perception models.
We benchmark the state-of-the-art fusion methods for the first time.
arXiv Detail & Related papers (2022-05-30T09:35:37Z) - Feature-based Event Stereo Visual Odometry [2.7298989068857487]
We propose a novel stereo visual odometry method for event cameras based on feature detection and matching with careful feature management.
We evaluate the performance of the proposed method on two publicly available datasets: MVSEC sequences captured by an indoor flying drone and DSEC outdoor driving sequences.
arXiv Detail & Related papers (2021-07-10T22:36:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.