A Light Touch Approach to Teaching Transformers Multi-view Geometry
- URL: http://arxiv.org/abs/2211.15107v2
- Date: Sun, 2 Apr 2023 12:15:52 GMT
- Title: A Light Touch Approach to Teaching Transformers Multi-view Geometry
- Authors: Yash Bhalgat, Joao F. Henriques, Andrew Zisserman
- Abstract summary: We propose a "light touch" approach to guiding visual Transformers to learn multiple-view geometry.
We achieve this by using epipolar lines to guide the Transformer's cross-attention maps.
Unlike previous methods, our proposal does not require any camera pose information at test-time.
- Score: 80.35521056416242
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers are powerful visual learners, in large part due to their
conspicuous lack of manually-specified priors. This flexibility can be
problematic in tasks that involve multiple-view geometry, due to the
near-infinite possible variations in 3D shapes and viewpoints (requiring
flexibility), and the precise nature of projective geometry (obeying rigid
laws). To resolve this conundrum, we propose a "light touch" approach, guiding
visual Transformers to learn multiple-view geometry but allowing them to break
free when needed. We achieve this by using epipolar lines to guide the
Transformer's cross-attention maps, penalizing attention values outside the
epipolar lines and encouraging higher attention along these lines since they
contain geometrically plausible matches. Unlike previous methods, our proposal
does not require any camera pose information at test-time. We focus on
pose-invariant object instance retrieval, where standard Transformer networks
struggle, due to the large differences in viewpoint between query and retrieved
images. Experimentally, our method outperforms state-of-the-art approaches at
object retrieval, without needing pose information at test-time.
Related papers
- Cross-domain and Cross-dimension Learning for Image-to-Graph
Transformers [50.576354045312115]
Direct image-to-graph transformation is a challenging task that solves object detection and relationship prediction in a single model.
We introduce a set of methods enabling cross-domain and cross-dimension transfer learning for image-to-graph transformers.
We demonstrate our method's utility in cross-domain and cross-dimension experiments, where we pretrain our models on 2D satellite images before applying them to vastly different target domains in 2D and 3D.
arXiv Detail & Related papers (2024-03-11T10:48:56Z) - SuperPrimitive: Scene Reconstruction at a Primitive Level [23.934492494774116]
Joint camera pose and dense geometry estimation from a set of images or a monocular video remains a challenging problem.
Most dense incremental reconstruction systems operate directly on image pixels and solve for their 3D positions using multi-view geometry cues.
We address this issue with a new image representation which we call a SuperPrimitive.
arXiv Detail & Related papers (2023-12-10T13:44:03Z) - Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z) - Learning Transformations To Reduce the Geometric Shift in Object
Detection [60.20931827772482]
We tackle geometric shifts emerging from variations in the image capture process.
We introduce a self-training approach that learns a set of geometric transformations to minimize these shifts.
We evaluate our method on two different shifts, i.e., a camera's field of view (FoV) change and a viewpoint change.
arXiv Detail & Related papers (2023-01-13T11:55:30Z) - Geometry-biased Transformers for Novel View Synthesis [36.11342728319563]
We tackle the task of synthesizing novel views of an object given a few input images and associated camera viewpoints.
Our work is inspired by recent 'geometry-free' approaches where multi-view images are encoded as a (global) set-latent representation.
We propose 'Geometry-biased Transformers' (GBTs) that incorporate geometric inductive biases in the set-latent representation-based inference.
arXiv Detail & Related papers (2023-01-11T18:59:56Z) - Learning Explicit Object-Centric Representations with Vision
Transformers [81.38804205212425]
We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers.
We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
arXiv Detail & Related papers (2022-10-25T16:39:49Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.