Related papers: EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization

EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization

URL: http://arxiv.org/abs/2402.13537v1
Date: Wed, 21 Feb 2024 05:26:17 GMT
Title: EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization
Authors: Zhendong Xiao, Changhao Chen, Shan Yang, Wu Wei
Abstract summary: We propose EffLoc, a novel efficient Vision Transformer for single-image camera relocalization. EffLoc excels in efficiency and accuracy, outperforming prior methods, such as AtLoc and MapNet. It thrives on large-scale outdoor car-driving scenario, ensuring simplicity, end-to-end trainability, and eliminating handcrafted loss functions.
Score: 12.980447668368274
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Camera relocalization is pivotal in computer vision, with applications in AR, drones, robotics, and autonomous driving. It estimates 3D camera position and orientation (6-DoF) from images. Unlike traditional methods like SLAM, recent strides use deep learning for direct end-to-end pose estimation. We propose EffLoc, a novel efficient Vision Transformer for single-image camera relocalization. EffLoc's hierarchical layout, memory-bound self-attention, and feed-forward layers boost memory efficiency and inter-channel communication. Our introduced sequential group attention (SGA) module enhances computational efficiency by diversifying input features, reducing redundancy, and expanding model capacity. EffLoc excels in efficiency and accuracy, outperforming prior methods, such as AtLoc and MapNet. It thrives on large-scale outdoor car-driving scenario, ensuring simplicity, end-to-end trainability, and eliminating handcrafted loss functions.

Related papers

FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video [52.33896173943054]
Egocentric motion capture with a head-mounted body-facing stereo camera is crucial for VR and AR applications. Existing methods rely on synthetic pretraining and struggle to generate smooth and accurate predictions in real-world settings. We propose FRAME, a simple yet effective architecture that combines device pose and camera feeds for state-of-the-art body pose prediction.
arXiv Detail & Related papers (2025-03-29T14:26:06Z)
VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment [63.21396416244634]
VideoLifter is a novel video-to-3D pipeline that leverages a local-to-global strategy on a fragment basis. It significantly accelerates the reconstruction process, reducing training time by over 82% while holding better visual quality than current SOTA methods.
arXiv Detail & Related papers (2025-01-03T18:52:36Z)
Hierarchical Information Flow for Generalized Efficient Image Restoration [108.83750852785582]
We propose a hierarchical information flow mechanism for image restoration, dubbed Hi-IR. Hi-IR constructs a hierarchical information tree representing the degraded image across three levels. In seven common image restoration tasks, Hi-IR achieves its effectiveness and generalizability.
arXiv Detail & Related papers (2024-11-27T18:30:08Z)
Improved Single Camera BEV Perception Using Multi-Camera Training [4.003066044908734]
In large-scale production, cost efficiency is an optimization goal, so that using fewer cameras becomes more relevant. This raises the problem of developing a BEV perception model that provides a sufficient performance on a low-cost sensor setup. The objective of our approach is to reduce the aforementioned performance drop as much as possible using a modern multi-camera surround view model reduced for single-camera inference.
arXiv Detail & Related papers (2024-09-04T13:06:40Z)
VICAN: Very Efficient Calibration Algorithm for Large Camera Networks [49.17165360280794]
We introduce a novel methodology that extends Pose Graph Optimization techniques. We consider the bipartite graph encompassing cameras, object poses evolving dynamically, and camera-object relative transformations at each time step. Our framework retains compatibility with traditional PGO solvers, but its efficacy benefits from a custom-tailored optimization scheme.
arXiv Detail & Related papers (2024-03-25T17:47:03Z)
EfficientMorph: Parameter-Efficient Transformer-Based Architecture for 3D Image Registration [1.741980945827445]
We present name, a transformer-based architecture for unsupervised 3D image registration. name balances local and global attention in 3D volumes through a plane-based attention mechanism and employs a Hi-Res tokenization strategy with merging operations.
arXiv Detail & Related papers (2024-03-16T22:01:55Z)
AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal Reasoning [63.628195002143734]
We propose a novel approach for aerial video action recognition. Our method is designed for videos captured using UAVs and can run on edge or mobile devices. We present a learning-based approach that uses customized auto zoom to automatically identify the human target and scale it appropriately.
arXiv Detail & Related papers (2023-03-02T21:24:19Z)
A Flexible Framework for Virtual Omnidirectional Vision to Improve Operator Situation Awareness [2.817412580574242]
We present a flexible framework for virtual projections to increase situation awareness based on a novel method to fuse multiple cameras mounted anywhere on the robot. We propose a complementary approach to improve scene understanding by fusing camera images and geometric 3D Lidar data to obtain a colorized point cloud.
arXiv Detail & Related papers (2023-02-01T10:40:05Z)
Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer. We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers. We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z)
Multi-Camera Calibration Free BEV Representation for 3D Object Detection [8.085831393926561]
We present a completely Multi-Camera Free Transformer (CFT) for robust Bird's Eye View (BEV) representation. CFT mines potential 3D information in BEV via our designed position-aware enhancement (PA) CFT achieves 49.7% NDS on the nuScenes detection task leaderboard, which is the first work removing camera parameters.
arXiv Detail & Related papers (2022-10-31T12:18:08Z)
Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling [125.95527079960725]
Transformer-based models have achieved top performance on major video recognition benchmarks. Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.
arXiv Detail & Related papers (2022-08-25T17:59:00Z)
Learned Queries for Efficient Local Attention [11.123272845092611]
Self-attention mechanism in vision transformers suffers from high latency and inefficient memory utilization. We propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner. We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models.
arXiv Detail & Related papers (2021-12-21T18:52:33Z)
CNN-based Omnidirectional Object Detection for HermesBot Autonomous Delivery Robot with Preliminary Frame Classification [53.56290185900837]
We propose an algorithm for optimizing a neural network for object detection using preliminary binary frame classification. An autonomous mobile robot with 6 rolling-shutter cameras on the perimeter providing a 360-degree field of view was used as the experimental setup.
arXiv Detail & Related papers (2021-10-22T15:05:37Z)
TransCamP: Graph Transformer for 6-DoF Camera Pose Estimation [77.09542018140823]
We propose a neural network approach with a graph transformer backbone, namely TransCamP, to address the camera relocalization problem. TransCamP effectively fuses the image features, camera pose information and inter-frame relative camera motions into encoded graph attributes.
arXiv Detail & Related papers (2021-05-28T19:08:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.