Related papers: A Light Touch Approach to Teaching Transformers Multi-view Geometry

A Light Touch Approach to Teaching Transformers Multi-view Geometry

URL: http://arxiv.org/abs/2211.15107v2
Date: Sun, 2 Apr 2023 12:15:52 GMT
Title: A Light Touch Approach to Teaching Transformers Multi-view Geometry
Authors: Yash Bhalgat, Joao F. Henriques, Andrew Zisserman
Abstract summary: We propose a "light touch" approach to guiding visual Transformers to learn multiple-view geometry. We achieve this by using epipolar lines to guide the Transformer's cross-attention maps. Unlike previous methods, our proposal does not require any camera pose information at test-time.
Score: 80.35521056416242
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers are powerful visual learners, in large part due to their conspicuous lack of manually-specified priors. This flexibility can be problematic in tasks that involve multiple-view geometry, due to the near-infinite possible variations in 3D shapes and viewpoints (requiring flexibility), and the precise nature of projective geometry (obeying rigid laws). To resolve this conundrum, we propose a "light touch" approach, guiding visual Transformers to learn multiple-view geometry but allowing them to break free when needed. We achieve this by using epipolar lines to guide the Transformer's cross-attention maps, penalizing attention values outside the epipolar lines and encouraging higher attention along these lines since they contain geometrically plausible matches. Unlike previous methods, our proposal does not require any camera pose information at test-time. We focus on pose-invariant object instance retrieval, where standard Transformer networks struggle, due to the large differences in viewpoint between query and retrieved images. Experimentally, our method outperforms state-of-the-art approaches at object retrieval, without needing pose information at test-time.

Related papers

Geometry-aware RL for Manipulation of Varying Shapes and Deformable Objects [14.481805160449282]
Manipulating objects with varying geometries and deformable objects is a major challenge in robotics. We frame this problem through the lens of a heterogeneous graph that comprises smaller sub-graphs. We present a novel and challenging reinforcement learning benchmark, including rigid insertion of diverse objects.
arXiv Detail & Related papers (2025-02-10T20:10:25Z)
Cross-domain and Cross-dimension Learning for Image-to-Graph Transformers [50.576354045312115]
Direct image-to-graph transformation is a challenging task that solves object detection and relationship prediction in a single model. We introduce a set of methods enabling cross-domain and cross-dimension transfer learning for image-to-graph transformers. We demonstrate our method's utility in cross-domain and cross-dimension experiments, where we pretrain our models on 2D satellite images before applying them to vastly different target domains in 2D and 3D.
arXiv Detail & Related papers (2024-03-11T10:48:56Z)
SuperPrimitive: Scene Reconstruction at a Primitive Level [23.934492494774116]
Joint camera pose and dense geometry estimation from a set of images or a monocular video remains a challenging problem. Most dense incremental reconstruction systems operate directly on image pixels and solve for their 3D positions using multi-view geometry cues. We address this issue with a new image representation which we call a SuperPrimitive.
arXiv Detail & Related papers (2023-12-10T13:44:03Z)
Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain. GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors. We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z)
Learning Transformations To Reduce the Geometric Shift in Object Detection [60.20931827772482]
We tackle geometric shifts emerging from variations in the image capture process. We introduce a self-training approach that learns a set of geometric transformations to minimize these shifts. We evaluate our method on two different shifts, i.e., a camera's field of view (FoV) change and a viewpoint change.
arXiv Detail & Related papers (2023-01-13T11:55:30Z)
Geometry-biased Transformers for Novel View Synthesis [36.11342728319563]
We tackle the task of synthesizing novel views of an object given a few input images and associated camera viewpoints. Our work is inspired by recent 'geometry-free' approaches where multi-view images are encoded as a (global) set-latent representation. We propose 'Geometry-biased Transformers' (GBTs) that incorporate geometric inductive biases in the set-latent representation-based inference.
arXiv Detail & Related papers (2023-01-11T18:59:56Z)
Learning Explicit Object-Centric Representations with Vision Transformers [81.38804205212425]
We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers. We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
arXiv Detail & Related papers (2022-10-25T16:39:49Z)
IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes [99.76677232870192]
We show how a dense vision transformer, IRISformer, excels at both single-task and multi-task reasoning required for inverse rendering. Specifically, we propose a transformer architecture to simultaneously estimate depths, normals, spatially-varying albedo, roughness and lighting from a single image of an indoor scene. Our evaluations on benchmark datasets demonstrate state-of-the-art results on each of the above tasks, enabling applications like object insertion and material editing in a single unconstrained real image.
arXiv Detail & Related papers (2022-06-16T19:50:55Z)
Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence. Transformers require minimal inductive biases for their design and are naturally suited as set-functions. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.