Related papers: Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach

Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach

URL: http://arxiv.org/abs/2305.06121v3
Date: Mon, 20 Jan 2025 19:22:24 GMT
Title: Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach
Authors: André O. Françani, Marcos R. O. A. Maximo,
Abstract summary: Estimating the camera's pose given images from a single camera is a traditional task in mobile robots.<n>Deep learning methods have been shown to be general after proper training and with a large amount of available data.<n>We present the TSformer-VO model based ontemporal selfattention mechanisms to extract features from clips and estimate the motions in an end-to-end manner.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Estimating the camera's pose given images from a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and often relies on geometric approaches that require considerable engineering effort for a specific scenario. Deep learning methods have been shown to be generalizable after proper training and with a large amount of available data. Transformer-based architectures have dominated the state-of-the-art in natural language processing and computer vision tasks, such as image and video understanding. In this work, we deal with the monocular visual odometry as a video understanding task to estimate the 6 degrees of freedom of a camera's pose. We contribute by presenting the TSformer-VO model based on spatio-temporal self-attention mechanisms to extract features from clips and estimate the motions in an end-to-end manner. Our approach achieved competitive state-of-the-art performance compared with geometry-based and deep learning-based methods on the KITTI visual odometry dataset, outperforming the DeepVO implementation highly accepted in the visual odometry community. The code is publicly available at https://github.com/aofrancani/TSformer-VO.

Related papers

Geometry-Constrained Monocular Scale Estimation Using Semantic Segmentation for Dynamic Scenes [3.635236692041662]
This study presents innovative strategies for ego-motion estimation and the selection of ground points. Our methodology incorporates dy-namic object masks to eliminate unstable features and employs ground plane masks for meticulous triangulation. The integration of this approach with the mo-nocular version of ORB-SLAM3 culminates in the accurate esti-mation of a road model.
arXiv Detail & Related papers (2025-03-06T09:15:13Z)
Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry [7.067145619709089]
We show that our self-supervised models can reach state-of-the-art performance 'without bells and whistles' For all datasets, our method outperforms state-of-the-art methods, in particular for depth prediction task.
arXiv Detail & Related papers (2024-06-16T17:24:20Z)
Learning depth from monocular video sequences [0.0]
We propose a novel training loss which enables us to include more images for supervision during the training process. We also design a novel network architecture for single image estimation.
arXiv Detail & Related papers (2023-10-26T05:00:41Z)
XVO: Generalized Visual Odometry via Cross-Modal Self-Training [11.70220331540621]
XVO is a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models. In contrast to standard monocular VO approaches which often study a known calibration within a single dataset, XVO efficiently learns to recover relative pose with real-world scale. We optimize the motion estimation model via self-training from large amounts of unconstrained and heterogeneous dash camera videos available on YouTube.
arXiv Detail & Related papers (2023-09-28T18:09:40Z)
State of the Art in Dense Monocular Non-Rigid 3D Reconstruction [100.9586977875698]
3D reconstruction of deformable (or non-rigid) scenes from a set of monocular 2D image observations is a long-standing and actively researched area of computer vision and graphics. This survey focuses on state-of-the-art methods for dense non-rigid 3D reconstruction of various deformable objects and composite scenes from monocular videos or sets of monocular views.
arXiv Detail & Related papers (2022-10-27T17:59:53Z)
Visual Odometry with Neuromorphic Resonator Networks [9.903137966539898]
Visual Odometry (VO) is a method to estimate self-motion of a mobile robot using visual sensors. Neuromorphic hardware offers low-power solutions to many vision and AI problems. We present a modular neuromorphic algorithm that achieves state-of-the-art performance on two-dimensional VO tasks.
arXiv Detail & Related papers (2022-09-05T14:57:03Z)
RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild [73.1276968007689]
We describe a data-driven method for inferring the camera viewpoints given multiple images of an arbitrary object. We show that our approach outperforms state-of-the-art SfM and SLAM methods given sparse images on both seen and unseen categories.
arXiv Detail & Related papers (2022-08-11T17:59:59Z)
DONet: Learning Category-Level 6D Object Pose and Size Estimation from Depth Observation [53.55300278592281]
We propose a method of Category-level 6D Object Pose and Size Estimation (COPSE) from a single depth image. Our framework makes inferences based on the rich geometric information of the object in the depth channel alone. Our framework competes with state-of-the-art approaches that require labeled real-world images.
arXiv Detail & Related papers (2021-06-27T10:41:50Z)
DF-VO: What Should Be Learnt for Visual Odometry? [33.379888882093965]
We design a simple yet robust Visual Odometry system by integrating multi-view geometry and deep learning on Depth and optical Flow. Comprehensive ablation studies show the effectiveness of the proposed method, and extensive evaluation results show the state-of-the-art performance of our system.
arXiv Detail & Related papers (2021-03-01T11:50:39Z)
Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency [114.02182755620784]
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision. Our framework is shown to outperform the state-of-the-art depth and motion estimation methods.
arXiv Detail & Related papers (2021-02-04T14:26:42Z)
Wide-angle Image Rectification: A Survey [86.36118799330802]
wide-angle images contain distortions that violate the assumptions underlying pinhole camera models. Image rectification, which aims to correct these distortions, can solve these problems. We present a detailed description and discussion of the camera models used in different approaches. Next, we review both traditional geometry-based image rectification methods and deep learning-based methods.
arXiv Detail & Related papers (2020-10-30T17:28:40Z)
Neural Ray Surfaces for Self-Supervised Learning of Depth and Ego-motion [51.19260542887099]
We show that self-supervision can be used to learn accurate depth and ego-motion estimation without prior knowledge of the camera model. Inspired by the geometric model of Grossberg and Nayar, we introduce Neural Ray Surfaces (NRS), convolutional networks that represent pixel-wise projection rays. We demonstrate the use of NRS for self-supervised learning of visual odometry and depth estimation from raw videos obtained using a wide variety of camera systems.
arXiv Detail & Related papers (2020-08-15T02:29:13Z)
A Geometric Perspective on Visual Imitation Learning [8.904045267033258]
We consider the problem of visual imitation learning without human supervision. We propose VGS-IL (Visual Geometric Skill Learning), which infers globally consistent geometric feature association rules from human video frames.
arXiv Detail & Related papers (2020-03-05T16:57:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.