LieRE: Generalizing Rotary Position Encodings
- URL: http://arxiv.org/abs/2406.10322v3
- Date: Tue, 18 Feb 2025 16:52:29 GMT
- Title: LieRE: Generalizing Rotary Position Encodings
- Authors: Sophie Ostmeier, Brian Axelrod, Michael E. Moseley, Akshay Chaudhari, Curtis Langlotz,
- Abstract summary: Rotary Position (RoPE) has emerged as a popular choice in language models.<n>RoPE is constrained to one-dimensional sequence data.<n>LieRE replaces RoPE's block-2D rotation matrix with a learned, dense, high-dimensional rotation matrix of variable sparsity.
- Score: 4.07373334379699
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer architectures rely on position encodings to capture token dependencies. Rotary Position Encoding (RoPE) has emerged as a popular choice in language models due to its efficient encoding of relative position information through key-query rotations. However, RoPE faces significant limitations beyond language processing: it is constrained to one-dimensional sequence data and, even with learnable phases, offers limited representational capacity. We address these challenges with Lie Relative Encodings (LieRE), which replaces RoPE's block-2D rotation matrix with a learned, dense, high-dimensional rotation matrix of variable sparsity. Through extensive evaluation on three image datasets across 2D and 3D classification tasks, LieRE achieves 2\% relative improvement over state-of-the-art baselines on 2D tasks and 1.5\% on 3D tasks, while demonstrating superior generalization to higher resolutions. Our implementation is computationally efficient, with results reproducible on 4 A100 GPUs in 30 minutes on CIFAR100, and we release our code to facilitate further research.
Related papers
- A-SCoRe: Attention-based Scene Coordinate Regression for wide-ranging scenarios [1.2093553114715083]
A-ScoRe is an Attention-based model which leverage attention on descriptor map level to produce meaningful and high-semantic 2D descriptors.
Results show our methods achieve comparable performance with State-of-the-art methods on multiple benchmark while being light-weighted and much more flexible.
arXiv Detail & Related papers (2025-03-18T07:39:50Z) - Learning the RoPEs: Better 2D and 3D Position Encodings with STRING [34.997879460336826]
STRING: Separable Translationally Invariant Position s.
We introduce STRING: Separable Translationally Invariant Position s.
arXiv Detail & Related papers (2025-02-04T18:37:17Z) - Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning [55.339257446600634]
We introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data.
We construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples.
Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks.
arXiv Detail & Related papers (2024-09-30T21:55:38Z) - RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation [9.121372333621538]
Whole-body pose estimation aims to predict fine-grained pose information for the human body.
We present RTMW (Real-Time Multi-person Whole-body pose estimation models), a series of high-performance models for 2D/3D whole-body pose estimation.
arXiv Detail & Related papers (2024-07-11T16:15:47Z) - Implicit-Zoo: A Large-Scale Dataset of Neural Implicit Functions for 2D Images and 3D Scenes [65.22070581594426]
"Implicit-Zoo" is a large-scale dataset requiring thousands of GPU training days to facilitate research and development in this field.
We showcase two immediate benefits as it enables to: (1) learn token locations for transformer models; (2) directly regress 3D cameras poses of 2D images with respect to NeRF models.
This in turn leads to an improved performance in all three task of image classification, semantic segmentation, and 3D pose regression, thereby unlocking new avenues for research.
arXiv Detail & Related papers (2024-06-25T10:20:44Z) - 3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding [12.335958945925437]
We propose a novel rotary position encoding on a three-dimensional sphere, named 3D Rotary Position (3D-RPE)
3D-RPE is an advanced version of the widely used 2D Rotary Position (RoPE)
For controllable long-term decay, 3D-RPE allows for the regulation of long-term decay within the chunk size.
For enhanced position resolution, 3D-RPE can mitigate the degradation of position resolution caused by position on RoPE.
arXiv Detail & Related papers (2024-06-14T10:13:37Z) - RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception [98.76525636842177]
RoScenes is the largest multi-view roadside perception dataset.
Our dataset achieves surprising 21.13M 3D annotations within 64,000 $m2$.
arXiv Detail & Related papers (2024-05-16T08:06:52Z) - Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization [51.33923845954759]
3D Visual Grounding (3DVG) and 3D Captioning (3DDC) are two crucial tasks in various 3D applications.
We propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks.
In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection.
arXiv Detail & Related papers (2024-04-17T04:46:27Z) - Rotary Position Embedding for Vision Transformer [44.27871591624888]
This study provides a comprehensive analysis of Rotary Position Embedding (RoPE) when applied to Vision Transformer (ViT)
RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference.
It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation.
arXiv Detail & Related papers (2024-03-20T04:47:13Z) - NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized
Device Coordinates Space [77.6067460464962]
Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs.
We identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Imbalance in the 3D convolution across different depth levels.
We devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2
arXiv Detail & Related papers (2023-09-26T02:09:52Z) - V-DETR: DETR with Vertex Relative Position Encoding for 3D Object
Detection [73.37781484123536]
We introduce a highly performant 3D object detector for point clouds using the DETR framework.
To address the limitation, we introduce a novel 3D Relative Position (3DV-RPE) method.
We show exceptional results on the challenging ScanNetV2 benchmark.
arXiv Detail & Related papers (2023-08-08T17:14:14Z) - For SALE: State-Action Representation Learning for Deep Reinforcement
Learning [60.42044715596703]
SALE is a novel approach for learning embeddings that model the nuanced interaction between state and action.
We integrate SALE and an adaptation of checkpoints for RL into TD3 to form the TD7 algorithm.
On OpenAI gym benchmark tasks, TD7 has an average performance gain of 276.7% and 50.7% over TD3 at 300k and 5M time steps, respectively.
arXiv Detail & Related papers (2023-06-04T19:47:46Z) - CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds [55.44204039410225]
We present a novel two-stage fully sparse convolutional 3D object detection framework, named CAGroup3D.
Our proposed method first generates some high-quality 3D proposals by leveraging the class-aware local group strategy on the object surface voxels.
To recover the features of missed voxels due to incorrect voxel-wise segmentation, we build a fully sparse convolutional RoI pooling module.
arXiv Detail & Related papers (2022-10-09T13:38:48Z) - The Devil is in the Pose: Ambiguity-free 3D Rotation-invariant Learning
via Pose-aware Convolution [18.595285633151715]
We develop a Pose-aware Rotation Invariant Convolution (i.e., PaRI-Conv)
We propose an Augmented Point Pair Feature (APPF) to fully encode the RI relative pose information, and a factorized dynamic kernel for pose-aware kernel generation.
Our PaRI-Conv surpasses the state-of-the-art RI methods while being more compact and efficient.
arXiv Detail & Related papers (2022-05-30T16:11:55Z) - Simple and Effective Synthesis of Indoor 3D Scenes [78.95697556834536]
We study the problem of immersive 3D indoor scenes from one or more images.
Our aim is to generate high-resolution images and videos from novel viewpoints.
We propose an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images.
arXiv Detail & Related papers (2022-04-06T17:54:46Z) - Making a Case for 3D Convolutions for Object Segmentation in Videos [16.167397418720483]
We show that 3D convolutional networks can be effectively applied to dense video prediction tasks such as salient object segmentation.
We propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules.
Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal benchmarks.
arXiv Detail & Related papers (2020-08-26T12:24:23Z) - Searching Collaborative Agents for Multi-plane Localization in 3D
Ultrasound [59.97366727654676]
3D ultrasound (US) is widely used due to its rich diagnostic information, portability and low cost.
Standard plane (SP) localization in US volume not only improves efficiency and reduces user-dependence, but also boosts 3D US interpretation.
We propose a novel Multi-Agent Reinforcement Learning framework to localize multiple uterine SPs in 3D US simultaneously.
arXiv Detail & Related papers (2020-07-30T07:23:55Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.