Keyframe-Based Feed-Forward Visual Odometry
- URL: http://arxiv.org/abs/2601.16020v1
- Date: Thu, 22 Jan 2026 14:45:42 GMT
- Title: Keyframe-Based Feed-Forward Visual Odometry
- Authors: Weichen Dai, Wenhan Su, Da Kong, Yuhang Ming, Wanzeng Kong,
- Abstract summary: Current foundation model based methods typically process raw image sequences indiscriminately.<n>We propose a novel feed-forward VO method that employs reinforcement learning to derive an adaptive visual policy in a data-driven manner.<n> Experimental results demonstrate that the proposed method achieves consistent and substantial improvements over state-of-the-art feed-forward VO methods.
- Score: 13.646685343885556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The emergence of visual foundation models has revolutionized visual odometry~(VO) and SLAM, enabling pose estimation and dense reconstruction within a single feed-forward network. However, unlike traditional pipelines that leverage keyframe methods to enhance efficiency and accuracy, current foundation model based methods, such as VGGT-Long, typically process raw image sequences indiscriminately. This leads to computational redundancy and degraded performance caused by low inter-frame parallax, which provides limited contextual stereo information. Integrating traditional geometric heuristics into these methods is non-trivial, as their performance depends on high-dimensional latent representations rather than explicit geometric metrics. To bridge this gap, we propose a novel keyframe-based feed-forward VO. Instead of relying on hand-crafted rules, our approach employs reinforcement learning to derive an adaptive keyframe policy in a data-driven manner, aligning selection with the intrinsic characteristics of the underlying foundation model. We train our agent on TartanAir dataset and conduct extensive evaluations across several real-world datasets. Experimental results demonstrate that the proposed method achieves consistent and substantial improvements over state-of-the-art feed-forward VO methods.
Related papers
- StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models [98.72926158261937]
We propose a training-free token pruning framework for Visual AutoRegressive models.<n>We employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information.<n>To maintain valid next-scale prediction under sparse tokens, we introduce a nearest neighbor feature propagation strategy.
arXiv Detail & Related papers (2026-03-02T11:35:05Z) - Toward Generalizable Deblurring: Leveraging Massive Blur Priors with Linear Attention for Real-World Scenarios [9.82847623835017]
GLOWDeblur is a Generalizable reaL-wOrld lightWeight Deblur model that combines convolution-based pre-reconstruction & domain alignment module with a lightweight diffusion backbone.<n>We propose Blur Pattern Pretraining (BPP), which acquires blur priors from simulation datasets and transfers them through joint fine-tuning on real data.<n>We further introduce Motion and Semantic Guidance (MoSeG) to strengthen blur priors under severe degradation, and integrate it into GLOWDeblur, a Generalizable reaL-wOrld lightWeight Deblur model that combines convolution-based pre-reconstruction &
arXiv Detail & Related papers (2026-01-10T11:01:31Z) - FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM [50.9765003472032]
FoundationSLAM is a learning-based monocular dense SLAM system for accurate and robust tracking and mapping.<n>Our core idea is to bridge flow estimation with reasoning by leveraging the guidance from foundation depth models.
arXiv Detail & Related papers (2025-12-31T17:57:45Z) - Visual Autoregressive Modelling for Monocular Depth Estimation [69.01449528371916]
We propose a monocular depth estimation method based on visual autoregressive ( VAR) priors.<n>Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism.<n>We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets.
arXiv Detail & Related papers (2025-12-27T17:08:03Z) - Manifold Decoders: A Framework for Generative Modeling from Nonlinear Embeddings [0.0]
We introduce a system- atic framework for constructing neural decoder architectures for prominent NLDR methods.<n>We extend this framework by implementing a diffusion-based generative process that operates directly within these learned manifold spaces.<n>Our findings reveal a fundamental trade-off: while the decoders successfully reconstruct data, their quality is surpassed by end-to-end optimized autoencoders.
arXiv Detail & Related papers (2025-10-15T14:50:51Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - Enhancing Surface Neural Implicits with Curvature-Guided Sampling and Uncertainty-Augmented Representations [37.42624848693373]
We introduce a method that directly digests depth images for the task of high-fidelity 3D reconstruction.
A simple sampling strategy is proposed to generate highly effective training data.
Despite its simplicity, our method outperforms a range of both classical and learning-based baselines.
arXiv Detail & Related papers (2023-06-03T12:23:17Z) - Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models.
We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models.
Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z) - Robust Visual Odometry Using Position-Aware Flow and Geometric Bundle
Adjustment [16.04240592057438]
A novel optical flow network (PANet) built on a position-aware mechanism is proposed first.
Then, a novel system that jointly estimates depth, optical flow, and ego-motion without a typical network to learning ego-motion is proposed.
Experiments show that the proposed system not only outperforms other state-of-the-art methods in terms of depth, flow, and VO estimation.
arXiv Detail & Related papers (2021-11-22T12:05:27Z) - Deep Keypoint-Based Camera Pose Estimation with Geometric Constraints [80.60538408386016]
Estimating relative camera poses from consecutive frames is a fundamental problem in visual odometry.
We propose an end-to-end trainable framework consisting of learnable modules for detection, feature extraction, matching and outlier rejection.
arXiv Detail & Related papers (2020-07-29T21:41:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.