Activating Self-Attention for Multi-Scene Absolute Pose Regression
- URL: http://arxiv.org/abs/2411.01443v2
- Date: Mon, 18 Nov 2024 02:45:17 GMT
- Title: Activating Self-Attention for Multi-Scene Absolute Pose Regression
- Authors: Miso Lee, Jihwan Kim, Jae-Pil Heo,
- Abstract summary: Multi-scene absolute pose regression addresses the demand for fast and memory-efficient camera pose estimation.
transformer encoders are underutilized due to the collapsed self-attention map.
We present an auxiliary loss that aligns queries and keys, preventing the distortion of query-key space.
- Score: 21.164101507575186
- License:
- Abstract: Multi-scene absolute pose regression addresses the demand for fast and memory-efficient camera pose estimation across various real-world environments. Nowadays, transformer-based model has been devised to regress the camera pose directly in multi-scenes. Despite its potential, transformer encoders are underutilized due to the collapsed self-attention map, having low representation capacity. This work highlights the problem and investigates it from a new perspective: distortion of query-key embedding space. Based on the statistical analysis, we reveal that queries and keys are mapped in completely different spaces while only a few keys are blended into the query region. This leads to the collapse of the self-attention map as all queries are considered similar to those few keys. Therefore, we propose simple but effective solutions to activate self-attention. Concretely, we present an auxiliary loss that aligns queries and keys, preventing the distortion of query-key space and encouraging the model to find global relations by self-attention. In addition, the fixed sinusoidal positional encoding is adopted instead of undertrained learnable one to reflect appropriate positional clues into the inputs of self-attention. As a result, our approach resolves the aforementioned problem effectively, thus outperforming existing methods in both outdoor and indoor scenes.
Related papers
- ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation [17.097170273209333]
Recovering camera poses from a set of images is a foundational task in 3D computer vision.
Recent data-driven approaches aim to directly output camera poses, either through regressing the 6DoF camera poses or formulating rotation as a probability distribution.
We propose ADen to unify the two frameworks by employing a generator and a discriminator.
arXiv Detail & Related papers (2024-08-16T22:45:46Z) - GLACE: Global Local Accelerated Coordinate Encoding [66.87005863868181]
Scene coordinate regression methods are effective in small-scale scenes but face significant challenges in large-scale scenes.
We propose GLACE, which integrates pre-trained global and local encodings and enables SCR to scale to large scenes with only a single small-sized network.
Our method achieves state-of-the-art results on large-scale scenes with a low-map-size model.
arXiv Detail & Related papers (2024-06-06T17:59:50Z) - DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task.
We first apply attention masking in each denoising step to make the generation more disentangled across different objects.
In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z) - Learning Robust Multi-Scale Representation for Neural Radiance Fields
from Unposed Images [65.41966114373373]
We present an improved solution to the neural image-based rendering problem in computer vision.
The proposed approach could synthesize a realistic image of the scene from a novel viewpoint at test time.
arXiv Detail & Related papers (2023-11-08T08:18:23Z) - Rethinking Query-Key Pairwise Interactions in Vision Transformers [5.141895475956681]
We propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights.
We develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark.
arXiv Detail & Related papers (2022-07-01T03:36:49Z) - KeypointNeRF: Generalizing Image-based Volumetric Avatars using Relative
Spatial Encoding of Keypoints [28.234772596912165]
We propose a highly effective approach to modeling high-fidelity volumetric avatars from sparse views.
One of the key ideas is to encode relative spatial 3D information via sparse 3D keypoints.
Our experiments show that a majority of errors in prior work stem from an inappropriate choice of spatial encoding.
arXiv Detail & Related papers (2022-05-10T15:57:03Z) - Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression [81.05772887221333]
We study the dense keypoint regression framework that is previously inferior to the keypoint detection and grouping framework.
We present a simple yet effective approach, named disentangled keypoint regression (DEKR)
We empirically show that the proposed direct regression method outperforms keypoint detection and grouping methods.
arXiv Detail & Related papers (2021-04-06T05:54:46Z) - Paying Attention to Activation Maps in Camera Pose Regression [4.232614032390374]
Camera pose regression methods apply a single forward pass to the query image to estimate the camera pose.
We propose an attention-based approach for pose regression, where the convolutional activation maps are used as sequential inputs.
Our proposed approach is shown to compare favorably to contemporary pose regressors schemes and achieves state-of-the-art accuracy across multiple benchmarks.
arXiv Detail & Related papers (2021-03-21T20:10:15Z) - Pose Guided Person Image Generation with Hidden p-Norm Regression [113.41144529452663]
We propose a novel approach to solve the pose guided person image generation task.
Our method estimates a pose-invariant feature matrix for each identity, and uses it to predict the target appearance conditioned on the target pose.
Our method yields competitive performance in all the aforementioned variant scenarios.
arXiv Detail & Related papers (2021-02-19T17:03:54Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.