Related papers: Modality-invariant Visual Odometry for Embodied Vision

Modality-invariant Visual Odometry for Embodied Vision

URL: http://arxiv.org/abs/2305.00348v1
Date: Sat, 29 Apr 2023 21:47:12 GMT
Title: Modality-invariant Visual Odometry for Embodied Vision
Authors: Marius Memmel, Roman Bachmann, Amir Zamir
Abstract summary: Visual Odometry (VO) is a practical substitute for unreliable GPS and compass sensors. Recent deep VO models limit themselves to a fixed set of input modalities, e.g., RGB and depth, while training on millions of samples. We propose a Transformer-based modality-invariant VO approach that can deal with diverse or changing sensor suites of navigation agents.
Score: 1.7188280334580197
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effectively localizing an agent in a realistic, noisy setting is crucial for many embodied vision tasks. Visual Odometry (VO) is a practical substitute for unreliable GPS and compass sensors, especially in indoor environments. While SLAM-based methods show a solid performance without large data requirements, they are less flexible and robust w.r.t. to noise and changes in the sensor suite compared to learning-based approaches. Recent deep VO models, however, limit themselves to a fixed set of input modalities, e.g., RGB and depth, while training on millions of samples. When sensors fail, sensor suites change, or modalities are intentionally looped out due to available resources, e.g., power consumption, the models fail catastrophically. Furthermore, training these models from scratch is even more expensive without simulator access or suitable existing models that can be fine-tuned. While such scenarios get mostly ignored in simulation, they commonly hinder a model's reusability in real-world applications. We propose a Transformer-based modality-invariant VO approach that can deal with diverse or changing sensor suites of navigation agents. Our model outperforms previous methods while training on only a fraction of the data. We hope this method opens the door to a broader range of real-world applications that can benefit from flexible and learned VO models.

Related papers

Galileo: Learning Global and Local Features in Pretrained Remote Sensing Models [34.71460539414284]
We introduce a novel and highly effective self-supervised learning approach to learn both large- and small-scale features. Our Galileo models obtain state-of-the-art results across diverse remote sensing tasks.
arXiv Detail & Related papers (2025-02-13T14:21:03Z)
3D Multi-Object Tracking with Semi-Supervised GRU-Kalman Filter [6.13623925528906]
3D Multi-Object Tracking (MOT) is essential for intelligent systems like autonomous driving and robotic sensing. We propose a GRU-based MOT method, which introduces a learnable Kalman filter into the motion module. This approach is able to learn object motion characteristics through data-driven learning, thereby avoiding the need for manual model design and model error.
arXiv Detail & Related papers (2024-11-13T08:34:07Z)
MPVO: Motion-Prior based Visual Odometry for PointGoal Navigation [3.9974562667271507]
Visual odometry (VO) is essential for enabling accurate point-goal navigation of embodied agents in indoor environments. Recent deep-learned VO methods show robust performance but suffer from sample inefficiency during training. We propose a robust and sample-efficient VO pipeline based on motion priors available while an agent is navigating an environment.
arXiv Detail & Related papers (2024-11-07T15:36:49Z)
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction. SMILE allows for the upscaling of source models into an MoE model without extra data or further training. We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z)
Self-Supervised Geometry-Guided Initialization for Robust Monocular Visual Odometry [9.79428015716139]
In this paper, we analyze major failure cases on outdoor benchmarks and expose shortcomings of a learning-based SLAM model (DROID-SLAM) We propose the use of self-supervised priors leveraging a frozen large-scale pre-trained monocular depth estimation to initialize the dense bundle adjustment process. Despite its simplicity, our proposed method demonstrates significant improvements on KITTI odometry, as well as the challenging DDAD benchmark.
arXiv Detail & Related papers (2024-06-03T01:59:29Z)
Value function estimation using conditional diffusion models for control [62.27184818047923]
We propose a simple algorithm called Diffused Value Function (DVF) It learns a joint multi-step model of the environment-robot interaction dynamics using a diffusion model. We show how DVF can be used to efficiently capture the state visitation measure for multiple controllers.
arXiv Detail & Related papers (2023-06-09T18:40:55Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
EMA-VIO: Deep Visual-Inertial Odometry with External Memory Attention [5.144653418944836]
Visual-inertial odometry (VIO) algorithms exploit the information from camera and inertial sensors to estimate position and translation. Recent deep learning based VIO models attract attentions as they provide pose information in a data-driven way. We propose a novel learning based VIO framework with external memory attention that effectively and efficiently combines visual and inertial features for states estimation.
arXiv Detail & Related papers (2022-09-18T07:05:36Z)
Incremental Online Learning Algorithms Comparison for Gesture and Visual Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification. Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z)
Can Deep Learning be Applied to Model-Based Multi-Object Tracking? [25.464269324261636]
Multi-object tracking (MOT) is the problem of tracking the state of an unknown and time-varying number of objects using noisy measurements. Deep learning (DL) has been increasingly used in MOT for improving tracking performance. In this paper, we propose a Transformer-based DL tracker and evaluate its performance in the model-based setting.
arXiv Detail & Related papers (2022-02-16T07:43:08Z)
VISTA 2.0: An Open, Data-driven Simulator for Multimodal Sensing and Policy Learning for Autonomous Vehicles [131.2240621036954]
We present VISTA, an open source, data-driven simulator that integrates multiple types of sensors for autonomous vehicles. Using high fidelity, real-world datasets, VISTA represents and simulates RGB cameras, 3D LiDAR, and event-based cameras. We demonstrate the ability to train and test perception-to-control policies across each of the sensor types and showcase the power of this approach via deployment on a full scale autonomous vehicle.
arXiv Detail & Related papers (2021-11-23T18:58:10Z)
Towards Optimal Strategies for Training Self-Driving Perception Models in Simulation [98.51313127382937]
We focus on the use of labels in the synthetic domain alone. Our approach introduces both a way to learn neural-invariant representations and a theoretically inspired view on how to sample the data from the simulator. We showcase our approach on the bird's-eye-view vehicle segmentation task with multi-sensor data.
arXiv Detail & Related papers (2021-11-15T18:37:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.