A Light Touch Approach to Teaching Transformers Multi-view Geometry
- URL: http://arxiv.org/abs/2211.15107v2
- Date: Sun, 2 Apr 2023 12:15:52 GMT
- Title: A Light Touch Approach to Teaching Transformers Multi-view Geometry
- Authors: Yash Bhalgat, Joao F. Henriques, Andrew Zisserman
- Abstract summary: We propose a "light touch" approach to guiding visual Transformers to learn multiple-view geometry.
We achieve this by using epipolar lines to guide the Transformer's cross-attention maps.
Unlike previous methods, our proposal does not require any camera pose information at test-time.
- Score: 80.35521056416242
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers are powerful visual learners, in large part due to their
conspicuous lack of manually-specified priors. This flexibility can be
problematic in tasks that involve multiple-view geometry, due to the
near-infinite possible variations in 3D shapes and viewpoints (requiring
flexibility), and the precise nature of projective geometry (obeying rigid
laws). To resolve this conundrum, we propose a "light touch" approach, guiding
visual Transformers to learn multiple-view geometry but allowing them to break
free when needed. We achieve this by using epipolar lines to guide the
Transformer's cross-attention maps, penalizing attention values outside the
epipolar lines and encouraging higher attention along these lines since they
contain geometrically plausible matches. Unlike previous methods, our proposal
does not require any camera pose information at test-time. We focus on
pose-invariant object instance retrieval, where standard Transformer networks
struggle, due to the large differences in viewpoint between query and retrieved
images. Experimentally, our method outperforms state-of-the-art approaches at
object retrieval, without needing pose information at test-time.
Related papers
- Geometry-aware RL for Manipulation of Varying Shapes and Deformable Objects [14.481805160449282]
Manipulating objects with varying geometries and deformable objects is a major challenge in robotics.
In this work, we frame this problem through the lens of a heterogeneous graph that comprises smaller sub-graphs.
We present a novel and challenging reinforcement learning benchmark, including rigid insertion of diverse objects.
arXiv Detail & Related papers (2025-02-10T20:10:25Z) - SuperPrimitive: Scene Reconstruction at a Primitive Level [23.934492494774116]
Joint camera pose and dense geometry estimation from a set of images or a monocular video remains a challenging problem.
Most dense incremental reconstruction systems operate directly on image pixels and solve for their 3D positions using multi-view geometry cues.
We address this issue with a new image representation which we call a SuperPrimitive.
arXiv Detail & Related papers (2023-12-10T13:44:03Z) - Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z) - Learning Transformations To Reduce the Geometric Shift in Object
Detection [60.20931827772482]
We tackle geometric shifts emerging from variations in the image capture process.
We introduce a self-training approach that learns a set of geometric transformations to minimize these shifts.
We evaluate our method on two different shifts, i.e., a camera's field of view (FoV) change and a viewpoint change.
arXiv Detail & Related papers (2023-01-13T11:55:30Z) - Geometry-biased Transformers for Novel View Synthesis [36.11342728319563]
We tackle the task of synthesizing novel views of an object given a few input images and associated camera viewpoints.
Our work is inspired by recent 'geometry-free' approaches where multi-view images are encoded as a (global) set-latent representation.
We propose 'Geometry-biased Transformers' (GBTs) that incorporate geometric inductive biases in the set-latent representation-based inference.
arXiv Detail & Related papers (2023-01-11T18:59:56Z) - Learning Explicit Object-Centric Representations with Vision
Transformers [81.38804205212425]
We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers.
We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
arXiv Detail & Related papers (2022-10-25T16:39:49Z) - IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering
in Indoor Scenes [99.76677232870192]
We show how a dense vision transformer, IRISformer, excels at both single-task and multi-task reasoning required for inverse rendering.
Specifically, we propose a transformer architecture to simultaneously estimate depths, normals, spatially-varying albedo, roughness and lighting from a single image of an indoor scene.
Our evaluations on benchmark datasets demonstrate state-of-the-art results on each of the above tasks, enabling applications like object insertion and material editing in a single unconstrained real image.
arXiv Detail & Related papers (2022-06-16T19:50:55Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.