Related papers: Cameras as Relative Positional Encoding

Cameras as Relative Positional Encoding

URL: http://arxiv.org/abs/2507.10496v1
Date: Mon, 14 Jul 2025 17:22:45 GMT
Title: Cameras as Relative Positional Encoding
Authors: Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, Angjoo Kanazawa,
Abstract summary: Multi-view transformers must use camera geometry to ground visual tokens in 3D space.<n>We show how relative camera conditioning improves performance in feedforward novel view synthesis.<n>We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative cognition, as well as larger model sizes.
Score: 37.675563572777136
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers are increasingly prevalent for multi-view computer vision tasks, where geometric relationships between viewpoints are critical for 3D perception. To leverage these relationships, multi-view transformers must use camera geometry to ground visual tokens in 3D space. In this work, we compare techniques for conditioning transformers on cameras: token-level raymap encodings, attention-level relative pose encodings, and a new relative encoding we propose -- Projective Positional Encoding (PRoPE) -- that captures complete camera frustums, both intrinsics and extrinsics, as a relative positional encoding. Our experiments begin by showing how relative camera conditioning improves performance in feedforward novel view synthesis, with further gains from PRoPE. This holds across settings: scenes with both shared and varying intrinsics, when combining token- and attention-level conditioning, and for generalization to inputs with out-of-distribution sequence lengths and camera intrinsics. We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative spatial cognition, as well as larger model sizes.

Related papers

A 2D Semantic-Aware Position Encoding for Vision Transformers [32.86183384267028]
Vision transformers have demonstrated significant advantages in computer vision tasks due to their ability to capture long-range dependencies and contextual relationships through self-attention.<n>Existing position encoding techniques, which are largely borrowed from natural language processing, fail to effectively capture semantic-aware positional relationships between image patches.<n>Traditional approaches like absolute position encoding and relative position encoding primarily focus on 1D linear position relationship, often the semantic similarity between distant yet contextually related patches.
arXiv Detail & Related papers (2025-05-14T15:17:34Z)
VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames [8.746291192336056]
We present VicaSplat, a novel framework for joint 3D Gaussians reconstruction and camera pose estimation.<n>The core of our method lies in a novel transformer-based network architecture.
arXiv Detail & Related papers (2025-03-13T11:56:05Z)
GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers [63.41460219156508]
We argue that existing positional encoding schemes are suboptimal for 3D vision tasks. We propose a geometry-aware attention mechanism that encodes the geometric structure of tokens as relative transformation. We show that our attention, called Geometric Transform Attention (GTA), improves learning efficiency and performance of state-of-the-art transformer-based NVS models.
arXiv Detail & Related papers (2023-10-16T13:16:09Z)
Viewpoint Equivariance for Multi-View 3D Object Detection [35.4090127133834]
State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input. We introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry.
arXiv Detail & Related papers (2023-03-25T19:56:41Z)
EpipolarNVS: leveraging on Epipolar geometry for single-image Novel View Synthesis [6.103988053817792]
Novel-view synthesis (NVS) can be tackled through different approaches, depending on the general setting. The most challenging scenario, the one where we stand in this work, only considers a unique source image to generate a novel one from another viewpoint. We introduce an innovative method that encodes the viewpoint transformation as a 2D feature image.
arXiv Detail & Related papers (2022-10-24T09:54:20Z)
Learnable Fourier Features for Multi-DimensionalSpatial Positional Encoding [96.9752763607738]
We propose a novel positional encoding method based on learnable Fourier features. Our experiments show that our learnable feature representation for multi-dimensional positional encoding outperforms existing methods.
arXiv Detail & Related papers (2021-06-05T04:40:18Z)
TransCamP: Graph Transformer for 6-DoF Camera Pose Estimation [77.09542018140823]
We propose a neural network approach with a graph transformer backbone, namely TransCamP, to address the camera relocalization problem. TransCamP effectively fuses the image features, camera pose information and inter-frame relative camera motions into encoded graph attributes.
arXiv Detail & Related papers (2021-05-28T19:08:43Z)
A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification [77.08204941207985]
Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras. We propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID.
arXiv Detail & Related papers (2021-04-05T02:50:16Z)
Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence. Transformers require minimal inductive biases for their design and are naturally suited as set-functions. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.