Transformer-based model for monocular visual odometry: a video
understanding approach
- URL: http://arxiv.org/abs/2305.06121v2
- Date: Tue, 12 Sep 2023 19:07:51 GMT
- Title: Transformer-based model for monocular visual odometry: a video
understanding approach
- Authors: Andr\'e O. Fran\c{c}ani and Marcos R. O. A. Maximo
- Abstract summary: We deal with the monocular visual odometry as a video understanding task to estimate the 6-F camera's pose.
We contribute by presenting the TS-DoVO model based on on-temporal self-attention mechanisms to extract features from clips and estimate the motions in an end-to-end manner.
Our approach achieved competitive state-of-the-art performance compared with geometry-based and deep learning-based methods on the KITTI visual odometry dataset.
- Score: 0.9790236766474201
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Estimating the camera's pose given images of a single camera is a traditional
task in mobile robots and autonomous vehicles. This problem is called monocular
visual odometry and it often relies on geometric approaches that require
considerable engineering effort for a specific scenario. Deep learning methods
have shown to be generalizable after proper training and a large amount of
available data. Transformer-based architectures have dominated the
state-of-the-art in natural language processing and computer vision tasks, such
as image and video understanding. In this work, we deal with the monocular
visual odometry as a video understanding task to estimate the 6-DoF camera's
pose. We contribute by presenting the TSformer-VO model based on
spatio-temporal self-attention mechanisms to extract features from clips and
estimate the motions in an end-to-end manner. Our approach achieved competitive
state-of-the-art performance compared with geometry-based and deep
learning-based methods on the KITTI visual odometry dataset, outperforming the
DeepVO implementation highly accepted in the visual odometry community.
Related papers
- Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry [7.067145619709089]
We show that our self-supervised models can reach state-of-the-art performance 'without bells and whistles'
For all datasets, our method outperforms state-of-the-art methods, in particular for depth prediction task.
arXiv Detail & Related papers (2024-06-16T17:24:20Z) - Learning depth from monocular video sequences [0.0]
We propose a novel training loss which enables us to include more images for supervision during the training process.
We also design a novel network architecture for single image estimation.
arXiv Detail & Related papers (2023-10-26T05:00:41Z) - State of the Art in Dense Monocular Non-Rigid 3D Reconstruction [100.9586977875698]
3D reconstruction of deformable (or non-rigid) scenes from a set of monocular 2D image observations is a long-standing and actively researched area of computer vision and graphics.
This survey focuses on state-of-the-art methods for dense non-rigid 3D reconstruction of various deformable objects and composite scenes from monocular videos or sets of monocular views.
arXiv Detail & Related papers (2022-10-27T17:59:53Z) - Visual Odometry with Neuromorphic Resonator Networks [9.903137966539898]
Visual Odometry (VO) is a method to estimate self-motion of a mobile robot using visual sensors.
Neuromorphic hardware offers low-power solutions to many vision and AI problems.
We present a modular neuromorphic algorithm that achieves state-of-the-art performance on two-dimensional VO tasks.
arXiv Detail & Related papers (2022-09-05T14:57:03Z) - RelPose: Predicting Probabilistic Relative Rotation for Single Objects
in the Wild [73.1276968007689]
We describe a data-driven method for inferring the camera viewpoints given multiple images of an arbitrary object.
We show that our approach outperforms state-of-the-art SfM and SLAM methods given sparse images on both seen and unseen categories.
arXiv Detail & Related papers (2022-08-11T17:59:59Z) - DONet: Learning Category-Level 6D Object Pose and Size Estimation from
Depth Observation [53.55300278592281]
We propose a method of Category-level 6D Object Pose and Size Estimation (COPSE) from a single depth image.
Our framework makes inferences based on the rich geometric information of the object in the depth channel alone.
Our framework competes with state-of-the-art approaches that require labeled real-world images.
arXiv Detail & Related papers (2021-06-27T10:41:50Z) - DF-VO: What Should Be Learnt for Visual Odometry? [33.379888882093965]
We design a simple yet robust Visual Odometry system by integrating multi-view geometry and deep learning on Depth and optical Flow.
Comprehensive ablation studies show the effectiveness of the proposed method, and extensive evaluation results show the state-of-the-art performance of our system.
arXiv Detail & Related papers (2021-03-01T11:50:39Z) - Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection
Consistency [114.02182755620784]
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision.
Our framework is shown to outperform the state-of-the-art depth and motion estimation methods.
arXiv Detail & Related papers (2021-02-04T14:26:42Z) - Wide-angle Image Rectification: A Survey [86.36118799330802]
wide-angle images contain distortions that violate the assumptions underlying pinhole camera models.
Image rectification, which aims to correct these distortions, can solve these problems.
We present a detailed description and discussion of the camera models used in different approaches.
Next, we review both traditional geometry-based image rectification methods and deep learning-based methods.
arXiv Detail & Related papers (2020-10-30T17:28:40Z) - Neural Ray Surfaces for Self-Supervised Learning of Depth and Ego-motion [51.19260542887099]
We show that self-supervision can be used to learn accurate depth and ego-motion estimation without prior knowledge of the camera model.
Inspired by the geometric model of Grossberg and Nayar, we introduce Neural Ray Surfaces (NRS), convolutional networks that represent pixel-wise projection rays.
We demonstrate the use of NRS for self-supervised learning of visual odometry and depth estimation from raw videos obtained using a wide variety of camera systems.
arXiv Detail & Related papers (2020-08-15T02:29:13Z) - A Geometric Perspective on Visual Imitation Learning [8.904045267033258]
We consider the problem of visual imitation learning without human supervision.
We propose VGS-IL (Visual Geometric Skill Learning), which infers globally consistent geometric feature association rules from human video frames.
arXiv Detail & Related papers (2020-03-05T16:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.