Related papers: Learning Multi-Scene Absolute Pose Regression with Transformers

Learning Multi-Scene Absolute Pose Regression with Transformers

URL: http://arxiv.org/abs/2103.11468v1
Date: Sun, 21 Mar 2021 19:21:44 GMT
Title: Learning Multi-Scene Absolute Pose Regression with Transformers
Authors: Yoli Shavit, Ron Ferens, Yosi Keller
Abstract summary: A convolutional backbone with a multi-layer perceptron head is trained with images and pose labels to embed a single reference scene at a time. We propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention. We evaluate our method on commonly benchmarked indoor and outdoor datasets and show that it surpasses both multi-scene and state-of-the-art single-scene absolute pose regressors.
Score: 4.232614032390374
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Absolute camera pose regressors estimate the position and orientation of a camera from the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron head is trained with images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended for learning multiple scenes by replacing the MLP head with a set of fully connected layers. In this work, we propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention and decoders transform latent features and scenes encoding into candidate pose predictions. This mechanism allows our model to focus on general features that are informative for localization while embedding multiple scenes in parallel. We evaluate our method on commonly benchmarked indoor and outdoor datasets and show that it surpasses both multi-scene and state-of-the-art single-scene absolute pose regressors. We make our code publicly available from here.

Related papers

Cameras as Relative Positional Encoding [37.675563572777136]
Multi-view transformers must use camera geometry to ground visual tokens in 3D space.<n>We show how relative camera conditioning improves performance in feedforward novel view synthesis.<n>We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative cognition, as well as larger model sizes.
arXiv Detail & Related papers (2025-07-14T17:22:45Z)
Coarse-to-Fine Multi-Scene Pose Regression with Transformers [19.927662512903915]
A convolutional backbone with a multi-layer perceptron (MLP) head is trained using images and pose labels to embed a single reference at a time. We propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention. Our method is evaluated on commonly benchmark indoor and outdoor datasets and has been shown to exceed both multi-scene and state-of-the-art single-scene absolute pose regressors.
arXiv Detail & Related papers (2023-08-22T20:43:31Z)
UnLoc: A Unified Framework for Video Localization Tasks [82.59118972890262]
UnLoc is a new approach for temporal localization in untrimmed videos. It uses pretrained image and text towers, and feeds tokens to a video-text fusion model. We achieve state of the art results on all three different localization tasks with a unified approach.
arXiv Detail & Related papers (2023-08-21T22:15:20Z)
FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses via Pixel-Aligned Scene Flow [26.528667940013598]
Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning. Key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion. We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass.
arXiv Detail & Related papers (2023-05-31T20:58:46Z)
BlobGAN: Spatially Disentangled Scene Representations [67.60387150586375]
We propose an unsupervised, mid-level representation for a generative model of scenes. The representation is mid-level in that it is neither per-pixel nor per-image; rather, scenes are modeled as a collection of spatial, depth-ordered "blobs" of features.
arXiv Detail & Related papers (2022-05-05T17:59:55Z)
Real-Time Neural Character Rendering with Pose-Guided Multiplane Images [75.62730144924566]
We propose pose-guided multiplane image (MPI) synthesis which can render an animatable character in real scenes with photorealistic quality. We use a portable camera rig to capture the multi-view images along with the driving signal for the moving subject.
arXiv Detail & Related papers (2022-04-25T17:51:38Z)
Neural Rendering of Humans in Novel View and Pose from Monocular Video [68.37767099240236]
We introduce a new method that generates photo-realistic humans under novel views and poses given a monocular video as input. Our method significantly outperforms existing approaches under unseen poses and novel views given monocular videos as input.
arXiv Detail & Related papers (2022-04-04T03:09:20Z)
Paying Attention to Activation Maps in Camera Pose Regression [4.232614032390374]
Camera pose regression methods apply a single forward pass to the query image to estimate the camera pose. We propose an attention-based approach for pose regression, where the convolutional activation maps are used as sequential inputs. Our proposed approach is shown to compare favorably to contemporary pose regressors schemes and achieves state-of-the-art accuracy across multiple benchmarks.
arXiv Detail & Related papers (2021-03-21T20:10:15Z)
Do We Really Need Scene-specific Pose Encoders? [0.0]
Visual pose regression models estimate the camera pose from a query image with a single forward pass. Current models learn pose encoding from an image using deep convolutional networks which are trained per scene. We propose that scene-specific pose encoders are not required for pose regression and that encodings trained for visual similarity can be used instead.
arXiv Detail & Related papers (2020-12-22T13:59:52Z)
Self-supervised Human Detection and Segmentation via Multi-view Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training. We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z)
Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D [100.93808824091258]
We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras. Our approach is to "lift" each image individually into a frustum of features for each camera, then "splat" all frustums into a bird's-eye-view grid. We show that the representations inferred by our model enable interpretable end-to-end motion planning by "shooting" template trajectories into a bird's-eye-view cost map output by our network.
arXiv Detail & Related papers (2020-08-13T06:29:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.