XVO: Generalized Visual Odometry via Cross-Modal Self-Training
- URL: http://arxiv.org/abs/2309.16772v3
- Date: Sun, 8 Oct 2023 16:32:45 GMT
- Title: XVO: Generalized Visual Odometry via Cross-Modal Self-Training
- Authors: Lei Lai and Zhongkai Shangguan and Jimuyang Zhang and Eshed Ohn-Bar
- Abstract summary: XVO is a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models.
In contrast to standard monocular VO approaches which often study a known calibration within a single dataset, XVO efficiently learns to recover relative pose with real-world scale.
We optimize the motion estimation model via self-training from large amounts of unconstrained and heterogeneous dash camera videos available on YouTube.
- Score: 11.70220331540621
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose XVO, a semi-supervised learning method for training generalized
monocular Visual Odometry (VO) models with robust off-the-self operation across
diverse datasets and settings. In contrast to standard monocular VO approaches
which often study a known calibration within a single dataset, XVO efficiently
learns to recover relative pose with real-world scale from visual scene
semantics, i.e., without relying on any known camera parameters. We optimize
the motion estimation model via self-training from large amounts of
unconstrained and heterogeneous dash camera videos available on YouTube. Our
key contribution is twofold. First, we empirically demonstrate the benefits of
semi-supervised training for learning a general-purpose direct VO regression
network. Second, we demonstrate multi-modal supervision, including
segmentation, flow, depth, and audio auxiliary prediction tasks, to facilitate
generalized representations for the VO task. Specifically, we find audio
prediction task to significantly enhance the semi-supervised learning process
while alleviating noisy pseudo-labels, particularly in highly dynamic and
out-of-domain video data. Our proposed teacher network achieves
state-of-the-art performance on the commonly used KITTI benchmark despite no
multi-frame optimization or knowledge of camera parameters. Combined with the
proposed semi-supervised step, XVO demonstrates off-the-shelf knowledge
transfer across diverse conditions on KITTI, nuScenes, and Argoverse without
fine-tuning.
Related papers
- Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry [7.067145619709089]
We show that our self-supervised models can reach state-of-the-art performance 'without bells and whistles'
For all datasets, our method outperforms state-of-the-art methods, in particular for depth prediction task.
arXiv Detail & Related papers (2024-06-16T17:24:20Z) - Learning from One Continuous Video Stream [70.30084026960819]
We introduce a framework for online learning from a single continuous video stream.
This poses great challenges given the high correlation between consecutive video frames.
We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation.
arXiv Detail & Related papers (2023-12-01T14:03:30Z) - E-ViLM: Efficient Video-Language Model via Masked Video Modeling with
Semantic Vector-Quantized Tokenizer [5.7254320553764]
E-ViLM is able to learn expressive representations from Video-Language corpus and generalize well to extensive Video-Language tasks.
Our model reaches $39.3$% Top-$1$ accuracy on the MSRVTT benchmark, retaining $91.4$% of the accuracy of state-of-the-art larger VL architecture.
arXiv Detail & Related papers (2023-11-28T22:57:17Z) - Unmasking Deepfakes: Masked Autoencoding Spatiotemporal Transformers for
Enhanced Video Forgery Detection [19.432851794777754]
We present a novel approach for the detection of deepfake videos using a pair of vision transformers pre-trained by a self-supervised masked autoencoding setup.
Our method consists of two distinct components, one of which focuses on learning spatial information from individual RGB frames of the video, while the other learns temporal consistency information from optical flow fields generated from consecutive frames.
arXiv Detail & Related papers (2023-06-12T05:49:23Z) - Bilevel Fast Scene Adaptation for Low-Light Image Enhancement [50.639332885989255]
Enhancing images in low-light scenes is a challenging but widely concerned task in the computer vision.
Main obstacle lies in the modeling conundrum from distribution discrepancy across different scenes.
We introduce the bilevel paradigm to model the above latent correspondence.
A bilevel learning framework is constructed to endow the scene-irrelevant generality of the encoder towards diverse scenes.
arXiv Detail & Related papers (2023-06-02T08:16:21Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - X-Learner: Learning Cross Sources and Tasks for Universal Visual
Representation [71.51719469058666]
We propose a representation learning framework called X-Learner.
X-Learner learns the universal feature of multiple vision tasks supervised by various sources.
X-Learner achieves strong performance on different tasks without extra annotations, modalities and computational costs.
arXiv Detail & Related papers (2022-03-16T17:23:26Z) - Towards Scale Consistent Monocular Visual Odometry by Learning from the
Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data.
We first train a scale-aware disparity network using both monocular real images and stereo virtual data.
The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z) - One to Many: Adaptive Instrument Segmentation via Meta Learning and
Dynamic Online Adaptation in Robotic Surgical Video [71.43912903508765]
MDAL is a dynamic online adaptive learning scheme for instrument segmentation in robot-assisted surgery.
It learns the general knowledge of instruments and the fast adaptation ability through the video-specific meta-learning paradigm.
It outperforms other state-of-the-art methods on two datasets.
arXiv Detail & Related papers (2021-03-24T05:02:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.