Related papers: Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry

Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry

URL: http://arxiv.org/abs/2409.08769v1
Date: Fri, 13 Sep 2024 12:21:25 GMT
Title: Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry
Authors: Yunus Bilge Kurt, Ahmet Akman, A. Aydın Alatan,
Abstract summary: We propose a causal visual-inertial fusion transformer (VIFT) for pose estimation in deep visual-inertial odometry. The proposed method is end-to-end trainable and requires only a monocular camera and IMU during inference.
Score: 1.2289361708127877
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, transformer-based architectures become the de facto standard for sequence modeling in deep learning frameworks. Inspired by the successful examples, we propose a causal visual-inertial fusion transformer (VIFT) for pose estimation in deep visual-inertial odometry. This study aims to improve pose estimation accuracy by leveraging the attention mechanisms in transformers, which better utilize historical data compared to the recurrent neural network (RNN) based methods seen in recent methods. Transformers typically require large-scale data for training. To address this issue, we utilize inductive biases for deep VIO networks. Since latent visual-inertial feature vectors encompass essential information for pose estimation, we employ transformers to refine pose estimates by updating latent vectors temporally. Our study also examines the impact of data imbalance and rotation learning methods in supervised end-to-end learning of visual inertial odometry by utilizing specialized gradients in backpropagation for the elements of SE$(3)$ group. The proposed method is end-to-end trainable and requires only a monocular camera and IMU during inference. Experimental results demonstrate that VIFT increases the accuracy of monocular VIO networks, achieving state-of-the-art results when compared to previous methods on the KITTI dataset. The code will be made available at https://github.com/ybkurt/VIFT.

Related papers

PerFormer: A Permutation Based Vision Transformer for Remaining Useful Life Prediction [0.0]
We introduce the PerFormer, a permutation-based vision transformer approach designed to permute multivariate time series data.<n>Our experiments on NASA's C-MAPSS dataset demonstrate the PerFormer's superior performance in RUL prediction.
arXiv Detail & Related papers (2025-05-30T21:49:10Z)
Open-Set Deepfake Detection: A Parameter-Efficient Adaptation Method with Forgery Style Mixture [58.60915132222421]
We introduce an approach that is both general and parameter-efficient for face forgery detection. We design a forgery-style mixture formulation that augments the diversity of forgery source domains. We show that the designed model achieves state-of-the-art generalizability with significantly reduced trainable parameters.
arXiv Detail & Related papers (2024-08-23T01:53:36Z)
Reverse Knowledge Distillation: Training a Large Model using a Small One for Retinal Image Matching on Limited Data [1.9521342770943706]
We propose a novel approach based on reverse knowledge distillation to train large models with limited data. We train a computationally heavier model based on a vision transformer encoder using the lighter CNN-based model. Our experiments suggest that high-dimensional fitting in representation space may prevent overfitting unlike training directly to match the final output.
arXiv Detail & Related papers (2023-07-20T08:39:20Z)
A Study on the Generality of Neural Network Structures for Monocular Depth Estimation [14.09373215954704]
We deeply investigate the various backbone networks toward the generalization of monocular depth estimation. We evaluate state-of-the-art models on both in-distribution and out-of-distribution datasets. We observe that the Transformers exhibit a strong shape-bias rather than CNNs, which have a strong texture-bias.
arXiv Detail & Related papers (2023-01-09T04:58:12Z)
DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation. We propose to leverage the Transformer to model this global context with an effective attention mechanism. Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z)
Benchmarking Detection Transfer Learning with Vision Transformers [60.97703494764904]
complexity of object detection methods can make benchmarking non-trivial when new architectures, such as Vision Transformer (ViT) models, arrive. We present training techniques that overcome these challenges, enabling the use of standard ViT models as the backbone of Mask R-CNN. Our results show that recent masking-based unsupervised learning methods may, for the first time, provide convincing transfer learning improvements on COCO.
arXiv Detail & Related papers (2021-11-22T18:59:15Z)
Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer' With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD) It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
On Robustness and Transferability of Convolutional Neural Networks [147.71743081671508]
Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts. We study the interplay between out-of-distribution and transfer performance of modern image classification CNNs for the first time. We find that increasing both the training set and model sizes significantly improve the distributional shift robustness.
arXiv Detail & Related papers (2020-07-16T18:39:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.