Coarse-to-Fine Multi-Scene Pose Regression with Transformers
- URL: http://arxiv.org/abs/2308.11783v1
- Date: Tue, 22 Aug 2023 20:43:31 GMT
- Title: Coarse-to-Fine Multi-Scene Pose Regression with Transformers
- Authors: Yoli Shavit, Ron Ferens, Yosi Keller
- Abstract summary: A convolutional backbone with a multi-layer perceptron (MLP) head is trained using images and pose labels to embed a single reference at a time.
We propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention.
Our method is evaluated on commonly benchmark indoor and outdoor datasets and has been shown to exceed both multi-scene and state-of-the-art single-scene absolute pose regressors.
- Score: 19.927662512903915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Absolute camera pose regressors estimate the position and orientation of a
camera given the captured image alone. Typically, a convolutional backbone with
a multi-layer perceptron (MLP) head is trained using images and pose labels to
embed a single reference scene at a time. Recently, this scheme was extended to
learn multiple scenes by replacing the MLP head with a set of fully connected
layers. In this work, we propose to learn multi-scene absolute camera pose
regression with Transformers, where encoders are used to aggregate activation
maps with self-attention and decoders transform latent features and scenes
encoding into pose predictions. This allows our model to focus on general
features that are informative for localization, while embedding multiple scenes
in parallel. We extend our previous MS-Transformer approach
\cite{shavit2021learning} by introducing a mixed classification-regression
architecture that improves the localization accuracy. Our method is evaluated
on commonly benchmark indoor and outdoor datasets and has been shown to exceed
both multi-scene and state-of-the-art single-scene absolute pose regressors.
Related papers
- NViST: In the Wild New View Synthesis from a Single Image with Transformers [8.361847255300846]
We propose NViST, a transformer-based model for efficient novel-view synthesis from a single image.
NViST is trained on MVImgNet, a large-scale dataset of casually-captured real-world videos.
We show results on unseen objects and categories from MVImgNet and even generalization to casual phone captures.
arXiv Detail & Related papers (2023-12-13T23:41:17Z) - Pose-Free Generalizable Rendering Transformer [72.47072706742065]
PF-GRT is a Pose-Free framework for Generalizable Rendering Transformer.
PF-GRT is parameterized using a local relative coordinate system.
Experiments with zero-shot rendering on datasets reveal that it produces superior quality in generating photo-realistic images.
arXiv Detail & Related papers (2023-10-05T17:24:36Z) - UnLoc: A Unified Framework for Video Localization Tasks [82.59118972890262]
UnLoc is a new approach for temporal localization in untrimmed videos.
It uses pretrained image and text towers, and feeds tokens to a video-text fusion model.
We achieve state of the art results on all three different localization tasks with a unified approach.
arXiv Detail & Related papers (2023-08-21T22:15:20Z) - Learning to Localize in Unseen Scenes with Relative Pose Regressors [5.672132510411465]
Relative pose regressors (RPRs) localize a camera by estimating its relative translation and rotation to a pose-labelled reference.
In practice, however, the performance of RPRs is significantly degraded in unseen scenes.
We implement aggregation with concatenation, projection, and attention operations (Transformers) and learn to regress the relative pose parameters from the resulting latent codes.
Compared to state-of-the-art RPRs, our model is shown to localize significantly better in unseen environments, across both indoor and outdoor benchmarks, while maintaining competitive performance in seen scenes.
arXiv Detail & Related papers (2023-03-05T17:12:50Z) - Direct Multi-view Multi-person 3D Pose Estimation [138.48139701871213]
We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images.
MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks.
We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient.
arXiv Detail & Related papers (2021-11-07T13:09:20Z) - End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer.
Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image.
Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z) - Paying Attention to Activation Maps in Camera Pose Regression [4.232614032390374]
Camera pose regression methods apply a single forward pass to the query image to estimate the camera pose.
We propose an attention-based approach for pose regression, where the convolutional activation maps are used as sequential inputs.
Our proposed approach is shown to compare favorably to contemporary pose regressors schemes and achieves state-of-the-art accuracy across multiple benchmarks.
arXiv Detail & Related papers (2021-03-21T20:10:15Z) - Learning Multi-Scene Absolute Pose Regression with Transformers [4.232614032390374]
A convolutional backbone with a multi-layer perceptron head is trained with images and pose labels to embed a single reference scene at a time.
We propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention.
We evaluate our method on commonly benchmarked indoor and outdoor datasets and show that it surpasses both multi-scene and state-of-the-art single-scene absolute pose regressors.
arXiv Detail & Related papers (2021-03-21T19:21:44Z) - 3D Human Pose Estimation with Spatial and Temporal Transformers [59.433208652418976]
We present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos.
Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure.
We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets.
arXiv Detail & Related papers (2021-03-18T18:14:37Z) - Do We Really Need Scene-specific Pose Encoders? [0.0]
Visual pose regression models estimate the camera pose from a query image with a single forward pass.
Current models learn pose encoding from an image using deep convolutional networks which are trained per scene.
We propose that scene-specific pose encoders are not required for pose regression and that encodings trained for visual similarity can be used instead.
arXiv Detail & Related papers (2020-12-22T13:59:52Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.