Related papers: Learning the RoPEs: Better 2D and 3D Position Encodings with STRING

Learning the RoPEs: Better 2D and 3D Position Encodings with STRING

URL: http://arxiv.org/abs/2502.02562v1
Date: Tue, 04 Feb 2025 18:37:17 GMT
Title: Learning the RoPEs: Better 2D and 3D Position Encodings with STRING
Authors: Connor Schenck, Isaac Reid, Mithun George Jacob, Alex Bewley, Joshua Ainslie, David Rendleman, Deepali Jain, Mohit Sharma, Avinava Dubey, Ayzaan Wahid, Sumeet Singh, Rene Wagner, Tianli Ding, Chuyuan Fu, Arunkumar Byravan, Jake Varley, Alexey Gritsenko, Matthias Minderer, Dmitry Kalashnikov, Jonathan Tompson, Vikas Sindhwani, Krzysztof Choromanski,
Abstract summary: STRING: Separable Translationally Invariant Position s.<n>We introduce STRING: Separable Translationally Invariant Position s.
Score: 34.997879460336826
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce STRING: Separable Translationally Invariant Position Encodings. STRING extends Rotary Position Encodings, a recently proposed and widely used algorithm in large language models, via a unifying theoretical framework. Importantly, STRING still provides exact translation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining a low computational footprint. These properties are especially important in robotics, where efficient 3D token representation is key. We integrate STRING into Vision Transformers with RGB(-D) inputs (color plus optional depth), showing substantial gains, e.g. in open-vocabulary object detection and for robotics controllers. We complement our experiments with a rigorous mathematical analysis, proving the universality of our methods.

Related papers

LieRE: Generalizing Rotary Position Encodings [4.07373334379699]
Rotary Position (RoPE) has emerged as a popular choice in language models. RoPE is constrained to one-dimensional sequence data. LieRE replaces RoPE's block-2D rotation matrix with a learned, dense, high-dimensional rotation matrix of variable sparsity.
arXiv Detail & Related papers (2024-06-14T17:41:55Z)
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models [0.0]
This paper introduces an efficient and robust method for discovering interpretable circuits in large language models. We propose training sparse autoencoders on carefully designed positive and negative examples. Our findings highlight the promise of discrete sparse autoencoders for scalable and efficient mechanistic interpretability.
arXiv Detail & Related papers (2024-05-21T06:26:10Z)
GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers [63.41460219156508]
We argue that existing positional encoding schemes are suboptimal for 3D vision tasks. We propose a geometry-aware attention mechanism that encodes the geometric structure of tokens as relative transformation. We show that our attention, called Geometric Transform Attention (GTA), improves learning efficiency and performance of state-of-the-art transformer-based NVS models.
arXiv Detail & Related papers (2023-10-16T13:16:09Z)
Spatial-information Guided Adaptive Context-aware Network for Efficient RGB-D Semantic Segmentation [9.198120596225968]
We propose an efficient lightweight encoder-decoder network that reduces the computational parameters and guarantees the robustness of the algorithm. Experimental results on NYUv2, SUN RGB-D, and Cityscapes datasets show that our method achieves a better trade-off among segmentation accuracy, inference time, and parameters than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-11T09:02:03Z)
SPE-Net: Boosting Point Cloud Analysis via Rotation Robustness Enhancement [118.20816888815658]
We propose a novel deep architecture tailored for 3D point cloud applications, named as SPE-Net. The embedded Selective Position variant' procedure relies on an attention mechanism that can effectively attend to the underlying rotation condition of the input. We demonstrate the merits of the SPE-Net and the associated hypothesis on four benchmarks, showing evident improvements on both rotated and unrotated test data over SOTA methods.
arXiv Detail & Related papers (2022-11-15T15:59:09Z)
Object Scene Representation Transformer [56.40544849442227]
We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis. OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods. It is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder.
arXiv Detail & Related papers (2022-06-14T15:40:47Z)
CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance [22.39628991021092]
We propose CodedVTR (Codebook-based Voxel TRansformer) for 3D sparse voxel transformers. On the one hand, we propose the codebook-based attention that projects an attention space into its subspace represented by the combination of "prototypes" in a learnable codebook. On the other hand, we propose geometry-aware self-attention that utilizes geometric information (geometric pattern, density) to guide attention learning.
arXiv Detail & Related papers (2022-03-18T11:50:25Z)
Progressive Coordinate Transforms for Monocular 3D Object Detection [52.00071336733109]
We propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations. In this paper, we propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
arXiv Detail & Related papers (2021-08-12T15:22:33Z)
Contextual Transformer Networks for Visual Recognition [103.79062359677452]
We design a novel Transformer-style module, i.e., Contextual Transformer (CoT) block, for visual recognition. Such design fully capitalizes on the contextual information among input keys to guide the learning of dynamic attention matrix. Our CoT block is appealing in the view that it can readily replace each $3times3$ convolution in ResNet architectures.
arXiv Detail & Related papers (2021-07-26T16:00:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.