TransCamP: Graph Transformer for 6-DoF Camera Pose Estimation
- URL: http://arxiv.org/abs/2105.14065v1
- Date: Fri, 28 May 2021 19:08:43 GMT
- Title: TransCamP: Graph Transformer for 6-DoF Camera Pose Estimation
- Authors: Xinyi Li, Haibin Ling
- Abstract summary: We propose a neural network approach with a graph transformer backbone, namely TransCamP, to address the camera relocalization problem.
TransCamP effectively fuses the image features, camera pose information and inter-frame relative camera motions into encoded graph attributes.
- Score: 77.09542018140823
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Camera pose estimation or camera relocalization is the centerpiece in
numerous computer vision tasks such as visual odometry, structure from motion
(SfM) and SLAM. In this paper we propose a neural network approach with a graph
transformer backbone, namely TransCamP, to address the camera relocalization
problem. In contrast with prior work where the pose regression is mainly guided
by photometric consistency, TransCamP effectively fuses the image features,
camera pose information and inter-frame relative camera motions into encoded
graph attributes and is trained towards the graph consistency and accuracy
instead, yielding significantly higher computational efficiency. By leveraging
graph transformer layers with edge features and enabling tensorized adjacency
matrix, TransCamP dynamically captures the global attention and thus endows the
pose graph with evolving structures to achieve improved robustness and
accuracy. In addition, optional temporal transformer layers actively enhance
the spatiotemporal inter-frame relation for sequential inputs. Evaluation of
the proposed network on various public benchmarks demonstrates that TransCamP
outperforms state-of-the-art approaches.
Related papers
- VICAN: Very Efficient Calibration Algorithm for Large Camera Networks [49.17165360280794]
We introduce a novel methodology that extends Pose Graph Optimization techniques.
We consider the bipartite graph encompassing cameras, object poses evolving dynamically, and camera-object relative transformations at each time step.
Our framework retains compatibility with traditional PGO solvers, but its efficacy benefits from a custom-tailored optimization scheme.
arXiv Detail & Related papers (2024-03-25T17:47:03Z) - Transformer-based Fusion of 2D-pose and Spatio-temporal Embeddings for
Distracted Driver Action Recognition [8.841708075914353]
Temporal localization of driving actions over time is important for advanced driver-assistance systems and naturalistic driving studies.
We aim to improve the temporal localization and classification accuracy performance by adapting video action recognition and 2D human-based estimation networks to one model.
The model performs well on the A2 test set the 2023 NVIDIA AI City Challenge for naturalistic driving action recognition.
arXiv Detail & Related papers (2024-03-11T10:26:38Z) - Automated Camera Calibration via Homography Estimation with GNNs [8.786192891436686]
Governments and local administrations are increasingly relying on the data collected from cameras to enhance road safety and optimize traffic conditions.
It is imperative to ensure accurate and automated calibration of the involved cameras.
This paper proposes a novel approach to address this challenge by leveraging the topological structure of intersections.
arXiv Detail & Related papers (2023-11-05T08:45:26Z) - Alignment-free HDR Deghosting with Semantics Consistent Transformer [76.91669741684173]
High dynamic range imaging aims to retrieve information from multiple low-dynamic range inputs to generate realistic output.
Existing methods often focus on the spatial misalignment across input frames caused by the foreground and/or camera motion.
We propose a novel alignment-free network with a Semantics Consistent Transformer (SCTNet) with both spatial and channel attention modules.
arXiv Detail & Related papers (2023-05-29T15:03:23Z) - Cross-Attention Transformer for Video Interpolation [3.5317804902980527]
TAIN (Transformers and Attention for video INterpolation) aims to interpolate an intermediate frame given two consecutive image frames around it.
We first present a novel visual transformer module, named Cross-Similarity (CS), to globally aggregate input image features with similar appearance as those of the predicted frame.
To account for occlusions in the CS features, we propose an Image Attention (IA) module to allow the network to focus on CS features from one frame over those of the other.
arXiv Detail & Related papers (2022-07-08T21:38:54Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Homography Decomposition Networks for Planar Object Tracking [11.558401177707312]
Planar object tracking plays an important role in AI applications, such as robotics, visual servoing, and visual SLAM.
We propose a novel Homography Decomposition Networks(HDN) approach that drastically reduces and stabilizes the condition number by decomposing the homography transformation into two groups.
arXiv Detail & Related papers (2021-12-15T06:13:32Z) - XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens.
The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.