EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera
Relocalization
- URL: http://arxiv.org/abs/2402.13537v1
- Date: Wed, 21 Feb 2024 05:26:17 GMT
- Title: EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera
Relocalization
- Authors: Zhendong Xiao, Changhao Chen, Shan Yang, Wu Wei
- Abstract summary: We propose EffLoc, a novel efficient Vision Transformer for single-image camera relocalization.
EffLoc excels in efficiency and accuracy, outperforming prior methods, such as AtLoc and MapNet.
It thrives on large-scale outdoor car-driving scenario, ensuring simplicity, end-to-end trainability, and eliminating handcrafted loss functions.
- Score: 12.980447668368274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Camera relocalization is pivotal in computer vision, with applications in AR,
drones, robotics, and autonomous driving. It estimates 3D camera position and
orientation (6-DoF) from images. Unlike traditional methods like SLAM, recent
strides use deep learning for direct end-to-end pose estimation. We propose
EffLoc, a novel efficient Vision Transformer for single-image camera
relocalization. EffLoc's hierarchical layout, memory-bound self-attention, and
feed-forward layers boost memory efficiency and inter-channel communication.
Our introduced sequential group attention (SGA) module enhances computational
efficiency by diversifying input features, reducing redundancy, and expanding
model capacity. EffLoc excels in efficiency and accuracy, outperforming prior
methods, such as AtLoc and MapNet. It thrives on large-scale outdoor
car-driving scenario, ensuring simplicity, end-to-end trainability, and
eliminating handcrafted loss functions.
Related papers
- Improved Single Camera BEV Perception Using Multi-Camera Training [4.003066044908734]
In large-scale production, cost efficiency is an optimization goal, so that using fewer cameras becomes more relevant.
This raises the problem of developing a BEV perception model that provides a sufficient performance on a low-cost sensor setup.
The objective of our approach is to reduce the aforementioned performance drop as much as possible using a modern multi-camera surround view model reduced for single-camera inference.
arXiv Detail & Related papers (2024-09-04T13:06:40Z) - VICAN: Very Efficient Calibration Algorithm for Large Camera Networks [49.17165360280794]
We introduce a novel methodology that extends Pose Graph Optimization techniques.
We consider the bipartite graph encompassing cameras, object poses evolving dynamically, and camera-object relative transformations at each time step.
Our framework retains compatibility with traditional PGO solvers, but its efficacy benefits from a custom-tailored optimization scheme.
arXiv Detail & Related papers (2024-03-25T17:47:03Z) - AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal
Reasoning [63.628195002143734]
We propose a novel approach for aerial video action recognition.
Our method is designed for videos captured using UAVs and can run on edge or mobile devices.
We present a learning-based approach that uses customized auto zoom to automatically identify the human target and scale it appropriately.
arXiv Detail & Related papers (2023-03-02T21:24:19Z) - A Flexible Framework for Virtual Omnidirectional Vision to Improve
Operator Situation Awareness [2.817412580574242]
We present a flexible framework for virtual projections to increase situation awareness based on a novel method to fuse multiple cameras mounted anywhere on the robot.
We propose a complementary approach to improve scene understanding by fusing camera images and geometric 3D Lidar data to obtain a colorized point cloud.
arXiv Detail & Related papers (2023-02-01T10:40:05Z) - Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer.
We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers.
We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z) - Multi-Camera Calibration Free BEV Representation for 3D Object Detection [8.085831393926561]
We present a completely Multi-Camera Free Transformer (CFT) for robust Bird's Eye View (BEV) representation.
CFT mines potential 3D information in BEV via our designed position-aware enhancement (PA)
CFT achieves 49.7% NDS on the nuScenes detection task leaderboard, which is the first work removing camera parameters.
arXiv Detail & Related papers (2022-10-31T12:18:08Z) - Video Mobile-Former: Video Recognition with Efficient Global
Spatial-temporal Modeling [125.95527079960725]
Transformer-based models have achieved top performance on major video recognition benchmarks.
Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.
arXiv Detail & Related papers (2022-08-25T17:59:00Z) - Learned Queries for Efficient Local Attention [11.123272845092611]
Self-attention mechanism in vision transformers suffers from high latency and inefficient memory utilization.
We propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner.
We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models.
arXiv Detail & Related papers (2021-12-21T18:52:33Z) - CNN-based Omnidirectional Object Detection for HermesBot Autonomous
Delivery Robot with Preliminary Frame Classification [53.56290185900837]
We propose an algorithm for optimizing a neural network for object detection using preliminary binary frame classification.
An autonomous mobile robot with 6 rolling-shutter cameras on the perimeter providing a 360-degree field of view was used as the experimental setup.
arXiv Detail & Related papers (2021-10-22T15:05:37Z) - TransCamP: Graph Transformer for 6-DoF Camera Pose Estimation [77.09542018140823]
We propose a neural network approach with a graph transformer backbone, namely TransCamP, to address the camera relocalization problem.
TransCamP effectively fuses the image features, camera pose information and inter-frame relative camera motions into encoded graph attributes.
arXiv Detail & Related papers (2021-05-28T19:08:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.