Related papers: Coarse-to-Fine Multi-Scene Pose Regression with Transformers

Coarse-to-Fine Multi-Scene Pose Regression with Transformers

URL: http://arxiv.org/abs/2308.11783v1
Date: Tue, 22 Aug 2023 20:43:31 GMT
Title: Coarse-to-Fine Multi-Scene Pose Regression with Transformers
Authors: Yoli Shavit, Ron Ferens, Yosi Keller
Abstract summary: A convolutional backbone with a multi-layer perceptron (MLP) head is trained using images and pose labels to embed a single reference at a time. We propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention. Our method is evaluated on commonly benchmark indoor and outdoor datasets and has been shown to exceed both multi-scene and state-of-the-art single-scene absolute pose regressors.
Score: 19.927662512903915
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Absolute camera pose regressors estimate the position and orientation of a camera given the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron (MLP) head is trained using images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended to learn multiple scenes by replacing the MLP head with a set of fully connected layers. In this work, we propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention and decoders transform latent features and scenes encoding into pose predictions. This allows our model to focus on general features that are informative for localization, while embedding multiple scenes in parallel. We extend our previous MS-Transformer approach \cite{shavit2021learning} by introducing a mixed classification-regression architecture that improves the localization accuracy. Our method is evaluated on commonly benchmark indoor and outdoor datasets and has been shown to exceed both multi-scene and state-of-the-art single-scene absolute pose regressors.

Related papers

StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation [98.10527466949338]
Current diffusion models for human image animation often struggle to maintain identity consistency.<n>We introduce StableAnimator++, the first ID-preserving video diffusion framework with learnable pose alignment.<n>We show how StableAnimator++ generates high-quality videos conditioned on a reference image and a pose sequence without any post-processing.
arXiv Detail & Related papers (2025-07-20T17:59:26Z)
Cameras as Relative Positional Encoding [37.675563572777136]
Multi-view transformers must use camera geometry to ground visual tokens in 3D space.<n>We show how relative camera conditioning improves performance in feedforward novel view synthesis.<n>We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative cognition, as well as larger model sizes.
arXiv Detail & Related papers (2025-07-14T17:22:45Z)
NViST: In the Wild New View Synthesis from a Single Image with Transformers [8.361847255300846]
We propose NViST, a transformer-based model for efficient novel-view synthesis from a single image. NViST is trained on MVImgNet, a large-scale dataset of casually-captured real-world videos. We show results on unseen objects and categories from MVImgNet and even generalization to casual phone captures.
arXiv Detail & Related papers (2023-12-13T23:41:17Z)
Pose-Free Generalizable Rendering Transformer [72.47072706742065]
PF-GRT is a Pose-Free framework for Generalizable Rendering Transformer. PF-GRT is parameterized using a local relative coordinate system. Experiments with zero-shot rendering on datasets reveal that it produces superior quality in generating photo-realistic images.
arXiv Detail & Related papers (2023-10-05T17:24:36Z)
UnLoc: A Unified Framework for Video Localization Tasks [82.59118972890262]
UnLoc is a new approach for temporal localization in untrimmed videos. It uses pretrained image and text towers, and feeds tokens to a video-text fusion model. We achieve state of the art results on all three different localization tasks with a unified approach.
arXiv Detail & Related papers (2023-08-21T22:15:20Z)
Learning to Localize in Unseen Scenes with Relative Pose Regressors [5.672132510411465]
Relative pose regressors (RPRs) localize a camera by estimating its relative translation and rotation to a pose-labelled reference. In practice, however, the performance of RPRs is significantly degraded in unseen scenes. We implement aggregation with concatenation, projection, and attention operations (Transformers) and learn to regress the relative pose parameters from the resulting latent codes. Compared to state-of-the-art RPRs, our model is shown to localize significantly better in unseen environments, across both indoor and outdoor benchmarks, while maintaining competitive performance in seen scenes.
arXiv Detail & Related papers (2023-03-05T17:12:50Z)
HyperPose: Hypernetwork-Infused Camera Pose Localization and an Extended Cambridge Landmarks Dataset [15.055091971627832]
We propose HyperPose, which utilizes hyper-networks in absolute camera pose regressors. The inherent appearance variations in natural scenes, attributable to environmental conditions, perspective, and lighting, induce a significant domain disparity between the training and test datasets. During inference, the hypernetwork dynamically computes adaptive weights for the localization regression heads based on the particular input image.
arXiv Detail & Related papers (2023-03-05T08:45:50Z)
Direct Multi-view Multi-person 3D Pose Estimation [138.48139701871213]
We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images. MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks. We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient.
arXiv Detail & Related papers (2021-11-07T13:09:20Z)
End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer. Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image. Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z)
Paying Attention to Activation Maps in Camera Pose Regression [4.232614032390374]
Camera pose regression methods apply a single forward pass to the query image to estimate the camera pose. We propose an attention-based approach for pose regression, where the convolutional activation maps are used as sequential inputs. Our proposed approach is shown to compare favorably to contemporary pose regressors schemes and achieves state-of-the-art accuracy across multiple benchmarks.
arXiv Detail & Related papers (2021-03-21T20:10:15Z)
Learning Multi-Scene Absolute Pose Regression with Transformers [4.232614032390374]
A convolutional backbone with a multi-layer perceptron head is trained with images and pose labels to embed a single reference scene at a time. We propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention. We evaluate our method on commonly benchmarked indoor and outdoor datasets and show that it surpasses both multi-scene and state-of-the-art single-scene absolute pose regressors.
arXiv Detail & Related papers (2021-03-21T19:21:44Z)
3D Human Pose Estimation with Spatial and Temporal Transformers [59.433208652418976]
We present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos. Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure. We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets.
arXiv Detail & Related papers (2021-03-18T18:14:37Z)
Do We Really Need Scene-specific Pose Encoders? [0.0]
Visual pose regression models estimate the camera pose from a query image with a single forward pass. Current models learn pose encoding from an image using deep convolutional networks which are trained per scene. We propose that scene-specific pose encoders are not required for pose regression and that encodings trained for visual similarity can be used instead.
arXiv Detail & Related papers (2020-12-22T13:59:52Z)
Self-supervised Human Detection and Segmentation via Multi-view Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training. We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.