EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera
Relocalization
- URL: http://arxiv.org/abs/2402.13537v1
- Date: Wed, 21 Feb 2024 05:26:17 GMT
- Title: EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera
Relocalization
- Authors: Zhendong Xiao, Changhao Chen, Shan Yang, Wu Wei
- Abstract summary: We propose EffLoc, a novel efficient Vision Transformer for single-image camera relocalization.
EffLoc excels in efficiency and accuracy, outperforming prior methods, such as AtLoc and MapNet.
It thrives on large-scale outdoor car-driving scenario, ensuring simplicity, end-to-end trainability, and eliminating handcrafted loss functions.
- Score: 12.980447668368274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Camera relocalization is pivotal in computer vision, with applications in AR,
drones, robotics, and autonomous driving. It estimates 3D camera position and
orientation (6-DoF) from images. Unlike traditional methods like SLAM, recent
strides use deep learning for direct end-to-end pose estimation. We propose
EffLoc, a novel efficient Vision Transformer for single-image camera
relocalization. EffLoc's hierarchical layout, memory-bound self-attention, and
feed-forward layers boost memory efficiency and inter-channel communication.
Our introduced sequential group attention (SGA) module enhances computational
efficiency by diversifying input features, reducing redundancy, and expanding
model capacity. EffLoc excels in efficiency and accuracy, outperforming prior
methods, such as AtLoc and MapNet. It thrives on large-scale outdoor
car-driving scenario, ensuring simplicity, end-to-end trainability, and
eliminating handcrafted loss functions.
Related papers
- VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment [62.6737516863285]
VideoLifter is a novel framework that incrementally optimize a globally sparse to dense 3D representation directly from video sequences.
By tracking and propagating sparse point correspondences across frames and fragments, VideoLifter incrementally refines camera poses and 3D structure.
This approach significantly accelerates the reconstruction process, reducing training time by over 82% while surpassing current state-of-the-art methods in visual fidelity and computational efficiency.
arXiv Detail & Related papers (2025-01-03T18:52:36Z) - Hierarchical Information Flow for Generalized Efficient Image Restoration [108.83750852785582]
We propose a hierarchical information flow mechanism for image restoration, dubbed Hi-IR.
Hi-IR constructs a hierarchical information tree representing the degraded image across three levels.
In seven common image restoration tasks, Hi-IR achieves its effectiveness and generalizability.
arXiv Detail & Related papers (2024-11-27T18:30:08Z) - Improved Single Camera BEV Perception Using Multi-Camera Training [4.003066044908734]
In large-scale production, cost efficiency is an optimization goal, so that using fewer cameras becomes more relevant.
This raises the problem of developing a BEV perception model that provides a sufficient performance on a low-cost sensor setup.
The objective of our approach is to reduce the aforementioned performance drop as much as possible using a modern multi-camera surround view model reduced for single-camera inference.
arXiv Detail & Related papers (2024-09-04T13:06:40Z) - VICAN: Very Efficient Calibration Algorithm for Large Camera Networks [49.17165360280794]
We introduce a novel methodology that extends Pose Graph Optimization techniques.
We consider the bipartite graph encompassing cameras, object poses evolving dynamically, and camera-object relative transformations at each time step.
Our framework retains compatibility with traditional PGO solvers, but its efficacy benefits from a custom-tailored optimization scheme.
arXiv Detail & Related papers (2024-03-25T17:47:03Z) - EfficientMorph: Parameter-Efficient Transformer-Based Architecture for 3D Image Registration [1.741980945827445]
We present name, a transformer-based architecture for unsupervised 3D image registration.
name balances local and global attention in 3D volumes through a plane-based attention mechanism and employs a Hi-Res tokenization strategy with merging operations.
arXiv Detail & Related papers (2024-03-16T22:01:55Z) - AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal
Reasoning [63.628195002143734]
We propose a novel approach for aerial video action recognition.
Our method is designed for videos captured using UAVs and can run on edge or mobile devices.
We present a learning-based approach that uses customized auto zoom to automatically identify the human target and scale it appropriately.
arXiv Detail & Related papers (2023-03-02T21:24:19Z) - A Flexible Framework for Virtual Omnidirectional Vision to Improve
Operator Situation Awareness [2.817412580574242]
We present a flexible framework for virtual projections to increase situation awareness based on a novel method to fuse multiple cameras mounted anywhere on the robot.
We propose a complementary approach to improve scene understanding by fusing camera images and geometric 3D Lidar data to obtain a colorized point cloud.
arXiv Detail & Related papers (2023-02-01T10:40:05Z) - Learned Queries for Efficient Local Attention [11.123272845092611]
Self-attention mechanism in vision transformers suffers from high latency and inefficient memory utilization.
We propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner.
We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models.
arXiv Detail & Related papers (2021-12-21T18:52:33Z) - CNN-based Omnidirectional Object Detection for HermesBot Autonomous
Delivery Robot with Preliminary Frame Classification [53.56290185900837]
We propose an algorithm for optimizing a neural network for object detection using preliminary binary frame classification.
An autonomous mobile robot with 6 rolling-shutter cameras on the perimeter providing a 360-degree field of view was used as the experimental setup.
arXiv Detail & Related papers (2021-10-22T15:05:37Z) - TransCamP: Graph Transformer for 6-DoF Camera Pose Estimation [77.09542018140823]
We propose a neural network approach with a graph transformer backbone, namely TransCamP, to address the camera relocalization problem.
TransCamP effectively fuses the image features, camera pose information and inter-frame relative camera motions into encoded graph attributes.
arXiv Detail & Related papers (2021-05-28T19:08:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.