Learning Monocular Visual Odometry via Self-Supervised Long-Term
Modeling
- URL: http://arxiv.org/abs/2007.10983v1
- Date: Tue, 21 Jul 2020 17:59:01 GMT
- Title: Learning Monocular Visual Odometry via Self-Supervised Long-Term
Modeling
- Authors: Yuliang Zou, Pan Ji, Quoc-Huy Tran, Jia-Bin Huang, Manmohan Chandraker
- Abstract summary: Monocular visual odometry (VO) suffers severely from error accumulation during frame-to-frame pose estimation.
We present a self-supervised learning method for VO with special consideration for consistency over longer sequences.
We train the networks with purely self-supervised losses, including a cycle consistency loss that mimics the loop closure module in geometric VO.
- Score: 106.15327903038705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monocular visual odometry (VO) suffers severely from error accumulation
during frame-to-frame pose estimation. In this paper, we present a
self-supervised learning method for VO with special consideration for
consistency over longer sequences. To this end, we model the long-term
dependency in pose prediction using a pose network that features a two-layer
convolutional LSTM module. We train the networks with purely self-supervised
losses, including a cycle consistency loss that mimics the loop closure module
in geometric VO. Inspired by prior geometric systems, we allow the networks to
see beyond a small temporal window during training, through a novel a loss that
incorporates temporally distant (e.g., O(100)) frames. Given GPU memory
constraints, we propose a stage-wise training mechanism, where the first stage
operates in a local time window and the second stage refines the poses with a
"global" loss given the first stage features. We demonstrate competitive
results on several standard VO datasets, including KITTI and TUM RGB-D.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Global-to-Local Modeling for Video-based 3D Human Pose and Shape
Estimation [53.04781510348416]
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness.
We propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT)
Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
arXiv Detail & Related papers (2023-03-26T14:57:49Z) - Large Scale Time-Series Representation Learning via Simultaneous Low and
High Frequency Feature Bootstrapping [7.0064929761691745]
We propose a non-contrastive self-supervised learning approach efficiently captures low and high-frequency time-varying features.
Our method takes raw time series data as input and creates two different augmented views for two branches of the model.
To demonstrate the robustness of our model we performed extensive experiments and ablation studies on five real-world time-series datasets.
arXiv Detail & Related papers (2022-04-24T14:39:47Z) - Decoupled Multi-task Learning with Cyclical Self-Regulation for Face
Parsing [71.19528222206088]
We propose a novel Decoupled Multi-task Learning with Cyclical Self-Regulation for face parsing.
Specifically, DML-CSR designs a multi-task model which comprises face parsing, binary edge, and category edge detection.
Our method achieves the new state-of-the-art performance on the Helen, CelebA-HQ, and LapaMask datasets.
arXiv Detail & Related papers (2022-03-28T02:12:30Z) - RAUM-VO: Rotational Adjusted Unsupervised Monocular Visual Odometry [0.0]
We present RAUM-VO, an approach based on a model-free epipolar constraint for frame-to-frame motion estimation.
RAUM-VO shows a considerable accuracy improvement compared to other unsupervised pose networks on the KITTI dataset.
arXiv Detail & Related papers (2022-03-14T15:03:24Z) - Imputing Missing Observations with Time Sliced Synthetic Minority
Oversampling Technique [0.3973560285628012]
We present a simple yet novel time series imputation technique with the goal of constructing an irregular time series that is uniform across every sample in a data set.
We fix a grid defined by the midpoints of non-overlapping bins (dubbed "slices") of observation times and ensure that each sample has values for all of the features at that given time.
This allows one to both impute fully missing observations to allow uniform time series classification across the entire data and, in special cases, to impute individually missing features.
arXiv Detail & Related papers (2022-01-14T19:23:24Z) - Unsupervised Monocular Depth Learning with Integrated Intrinsics and
Spatio-Temporal Constraints [61.46323213702369]
This work presents an unsupervised learning framework that is able to predict at-scale depth maps and egomotion.
Our results demonstrate strong performance when compared to the current state-of-the-art on multiple sequences of the KITTI driving dataset.
arXiv Detail & Related papers (2020-11-02T22:26:58Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.