LieRE: Generalizing Rotary Position Encodings
- URL: http://arxiv.org/abs/2406.10322v3
- Date: Tue, 18 Feb 2025 16:52:29 GMT
- Title: LieRE: Generalizing Rotary Position Encodings
- Authors: Sophie Ostmeier, Brian Axelrod, Michael E. Moseley, Akshay Chaudhari, Curtis Langlotz,
- Abstract summary: Rotary Position (RoPE) has emerged as a popular choice in language models.
RoPE is constrained to one-dimensional sequence data.
LieRE replaces RoPE's block-2D rotation matrix with a learned, dense, high-dimensional rotation matrix of variable sparsity.
- Score: 4.07373334379699
- License:
- Abstract: Transformer architectures rely on position encodings to capture token dependencies. Rotary Position Encoding (RoPE) has emerged as a popular choice in language models due to its efficient encoding of relative position information through key-query rotations. However, RoPE faces significant limitations beyond language processing: it is constrained to one-dimensional sequence data and, even with learnable phases, offers limited representational capacity. We address these challenges with Lie Relative Encodings (LieRE), which replaces RoPE's block-2D rotation matrix with a learned, dense, high-dimensional rotation matrix of variable sparsity. Through extensive evaluation on three image datasets across 2D and 3D classification tasks, LieRE achieves 2\% relative improvement over state-of-the-art baselines on 2D tasks and 1.5\% on 3D tasks, while demonstrating superior generalization to higher resolutions. Our implementation is computationally efficient, with results reproducible on 4 A100 GPUs in 30 minutes on CIFAR100, and we release our code to facilitate further research.
Related papers
- Learning the RoPEs: Better 2D and 3D Position Encodings with STRING [34.997879460336826]
STRING: Separable Translationally Invariant Position s.
We introduce STRING: Separable Translationally Invariant Position s.
arXiv Detail & Related papers (2025-02-04T18:37:17Z) - Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning [55.339257446600634]
We introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data.
We construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples.
Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks.
arXiv Detail & Related papers (2024-09-30T21:55:38Z) - Implicit-Zoo: A Large-Scale Dataset of Neural Implicit Functions for 2D Images and 3D Scenes [65.22070581594426]
"Implicit-Zoo" is a large-scale dataset requiring thousands of GPU training days to facilitate research and development in this field.
We showcase two immediate benefits as it enables to: (1) learn token locations for transformer models; (2) directly regress 3D cameras poses of 2D images with respect to NeRF models.
This in turn leads to an improved performance in all three task of image classification, semantic segmentation, and 3D pose regression, thereby unlocking new avenues for research.
arXiv Detail & Related papers (2024-06-25T10:20:44Z) - NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized
Device Coordinates Space [77.6067460464962]
Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs.
We identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Imbalance in the 3D convolution across different depth levels.
We devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2
arXiv Detail & Related papers (2023-09-26T02:09:52Z) - V-DETR: DETR with Vertex Relative Position Encoding for 3D Object
Detection [73.37781484123536]
We introduce a highly performant 3D object detector for point clouds using the DETR framework.
To address the limitation, we introduce a novel 3D Relative Position (3DV-RPE) method.
We show exceptional results on the challenging ScanNetV2 benchmark.
arXiv Detail & Related papers (2023-08-08T17:14:14Z) - CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds [55.44204039410225]
We present a novel two-stage fully sparse convolutional 3D object detection framework, named CAGroup3D.
Our proposed method first generates some high-quality 3D proposals by leveraging the class-aware local group strategy on the object surface voxels.
To recover the features of missed voxels due to incorrect voxel-wise segmentation, we build a fully sparse convolutional RoI pooling module.
arXiv Detail & Related papers (2022-10-09T13:38:48Z) - Inferring 3D change detection from bitemporal optical images [6.050310428775564]
We propose two novel networks, able to solve simultaneously the 2D and 3D CD tasks.
The aim of this work is to lay the foundations for the development of DL algorithms able to automatically infer an elevation (3D) CD map.
The code and the 3DCD dataset are available at urlhttps://sites.google.com/uniroma1.it/3dchangedetection/home-page.
arXiv Detail & Related papers (2022-05-31T15:53:33Z) - The Devil is in the Pose: Ambiguity-free 3D Rotation-invariant Learning
via Pose-aware Convolution [18.595285633151715]
We develop a Pose-aware Rotation Invariant Convolution (i.e., PaRI-Conv)
We propose an Augmented Point Pair Feature (APPF) to fully encode the RI relative pose information, and a factorized dynamic kernel for pose-aware kernel generation.
Our PaRI-Conv surpasses the state-of-the-art RI methods while being more compact and efficient.
arXiv Detail & Related papers (2022-05-30T16:11:55Z) - Making a Case for 3D Convolutions for Object Segmentation in Videos [16.167397418720483]
We show that 3D convolutional networks can be effectively applied to dense video prediction tasks such as salient object segmentation.
We propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules.
Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal benchmarks.
arXiv Detail & Related papers (2020-08-26T12:24:23Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.