Benchmarking Visual-Inertial Deep Multimodal Fusion for Relative Pose
Regression and Odometry-aided Absolute Pose Regression
- URL: http://arxiv.org/abs/2208.00919v3
- Date: Fri, 4 Aug 2023 08:36:02 GMT
- Title: Benchmarking Visual-Inertial Deep Multimodal Fusion for Relative Pose
Regression and Odometry-aided Absolute Pose Regression
- Authors: Felix Ott and Nisha Lakshmana Raichur and David R\"ugamer and Tobias
Feigl and Heiko Neumann and Bernd Bischl and Christopher Mutschler
- Abstract summary: Visual-inertial localization is a key problem in computer vision and robotics applications such as virtual reality, self-driving cars, and aerial vehicles.
In this work, we conduct a benchmark to evaluate deep multimodal fusion based on pose graph optimization and attention networks.
We show improvements for the APR-RPR task and for the RPR-RPR task for aerial vehicles and handheld devices.
- Score: 6.557612703872671
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual-inertial localization is a key problem in computer vision and robotics
applications such as virtual reality, self-driving cars, and aerial vehicles.
The goal is to estimate an accurate pose of an object when either the
environment or the dynamics are known. Absolute pose regression (APR)
techniques directly regress the absolute pose from an image input in a known
scene using convolutional and spatio-temporal networks. Odometry methods
perform relative pose regression (RPR) that predicts the relative pose from a
known object dynamic (visual or inertial inputs). The localization task can be
improved by retrieving information from both data sources for a cross-modal
setup, which is a challenging problem due to contradictory tasks. In this work,
we conduct a benchmark to evaluate deep multimodal fusion based on pose graph
optimization and attention networks. Auxiliary and Bayesian learning are
utilized for the APR task. We show accuracy improvements for the APR-RPR task
and for the RPR-RPR task for aerial vehicles and hand-held devices. We conduct
experiments on the EuRoC MAV and PennCOSYVIO datasets and record and evaluate a
novel industry dataset.
Related papers
- Deep Homography Estimation for Visual Place Recognition [49.235432979736395]
We propose a transformer-based deep homography estimation (DHE) network.
It takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification.
Experiments on benchmark datasets show that our method can outperform several state-of-the-art methods.
arXiv Detail & Related papers (2024-02-25T13:22:17Z) - KS-APR: Keyframe Selection for Robust Absolute Pose Regression [2.541264438930729]
Markerless Mobile Augmented Reality (AR) aims to anchor digital content in the physical world without using specific 2D or 3D objects.
End-to-end machine learning solutions infer the device's pose from a single monocular image.
APR methods tend to yield significant inaccuracies for input images that are too distant from the training set.
This paper introduces KS-APR, a pipeline that assesses the reliability of an estimated pose with minimal overhead.
arXiv Detail & Related papers (2023-08-10T09:32:20Z) - Fusing Structure from Motion and Simulation-Augmented Pose Regression from Optical Flow for Challenging Indoor Environments [13.654208446015824]
The localization of objects is a crucial task in various applications such as robotics, virtual and augmented reality, and the transportation of goods in warehouses.
Recent advances in deep learning have enabled the localization using monocular visual cameras.
This study aims to address these challenges by incorporating additional information and regularizing the absolute pose using relative pose regression (RPR) methods.
arXiv Detail & Related papers (2023-04-14T16:58:23Z) - Learning to Localize in Unseen Scenes with Relative Pose Regressors [5.672132510411465]
Relative pose regressors (RPRs) localize a camera by estimating its relative translation and rotation to a pose-labelled reference.
In practice, however, the performance of RPRs is significantly degraded in unseen scenes.
We implement aggregation with concatenation, projection, and attention operations (Transformers) and learn to regress the relative pose parameters from the resulting latent codes.
Compared to state-of-the-art RPRs, our model is shown to localize significantly better in unseen environments, across both indoor and outdoor benchmarks, while maintaining competitive performance in seen scenes.
arXiv Detail & Related papers (2023-03-05T17:12:50Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - DeepRM: Deep Recurrent Matching for 6D Pose Refinement [77.34726150561087]
DeepRM is a novel recurrent network architecture for 6D pose refinement.
The architecture incorporates LSTM units to propagate information through each refinement step.
DeepRM achieves state-of-the-art performance on two widely accepted challenging datasets.
arXiv Detail & Related papers (2022-05-28T16:18:08Z) - Towards Scale Consistent Monocular Visual Odometry by Learning from the
Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data.
We first train a scale-aware disparity network using both monocular real images and stereo virtual data.
The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z) - Towards Optimal Strategies for Training Self-Driving Perception Models
in Simulation [98.51313127382937]
We focus on the use of labels in the synthetic domain alone.
Our approach introduces both a way to learn neural-invariant representations and a theoretically inspired view on how to sample the data from the simulator.
We showcase our approach on the bird's-eye-view vehicle segmentation task with multi-sensor data.
arXiv Detail & Related papers (2021-11-15T18:37:43Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - Risk-Averse MPC via Visual-Inertial Input and Recurrent Networks for
Online Collision Avoidance [95.86944752753564]
We propose an online path planning architecture that extends the model predictive control (MPC) formulation to consider future location uncertainties.
Our algorithm combines an object detection pipeline with a recurrent neural network (RNN) which infers the covariance of state estimates.
The robustness of our methods is validated on complex quadruped robot dynamics and can be generally applied to most robotic platforms.
arXiv Detail & Related papers (2020-07-28T07:34:30Z) - Denoising IMU Gyroscopes with Deep Learning for Open-Loop Attitude
Estimation [0.0]
This paper proposes a learning method for denoising gyroscopes of Inertial Measurement Units (IMUs) using ground truth data.
The obtained algorithm outperforms the state-of-the-art on the (unseen) test sequences.
arXiv Detail & Related papers (2020-02-25T08:04:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.