Related papers: LieRE: Lie Rotational Positional Encodings

Related papers

Selective Rotary Position Embedding [84.22998043041198]
We introduce textitSelective RoPE, an textitinput-dependent rotary embedding mechanism.<n>We show that softmax attention already performs a hidden form of these rotations on query-key pairs.<n>We validate our method by equipping gated transformers with textitSelective RoPE, demonstrating that its input-dependent rotations improve performance in language modeling.
arXiv Detail & Related papers (2025-11-21T16:50:00Z)
3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering [52.01655676571933]
3D-MoRe is designed to generate large-scale 3D-language datasets by leveraging the strengths of foundational models.<n>The framework integrates key components, including multi-modal embedding, cross-modal interaction, and a language model decoder.<n>Using the ScanNet 3D scene dataset, along with text annotations from ScanQA and ScanRefer, 3D-MoRe generates 62,000 question-answer pairs and 73,000 object descriptions.
arXiv Detail & Related papers (2025-07-16T08:38:26Z)
SeqPE: Transformer with Sequential Position Encoding [76.22159277300891]
SeqPE represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings.<n> Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM) and accuracy--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign.
arXiv Detail & Related papers (2025-06-16T09:16:40Z)
ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices [25.99231204405503]
We propose ComRoPE, which generalizes Rotary Positional PE (RoPE) by defining it in terms of trainable commuting angle matrices.<n>We present two types of trainable commuting angle matrices as sufficient solutions to the RoPE equation.<n>Our framework shows versatility in generalizing to existing RoPE formulations and offering new insights for future positional encoding research.
arXiv Detail & Related papers (2025-06-04T09:10:02Z)
Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability [53.21677928601684]
Layer-wise relevance propagation is one of the most promising approaches to explainability in deep learning.<n>We propose specialized theoretically-grounded LRP rules designed to propagate attributions across various positional encoding methods.<n>Our method significantly outperforms the state-of-the-art in both vision and NLP explainability tasks.
arXiv Detail & Related papers (2025-06-02T18:07:55Z)
PaTH Attention: Position Encoding via Accumulating Householder Transformations [56.32365080761523]
PaTH is a flexible data-dependent position encoding scheme based on accumulated products of Householder transformations.<n>We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices.
arXiv Detail & Related papers (2025-05-22T08:36:09Z)
SLAG: Scalable Language-Augmented Gaussian Splatting [19.643023058839603]
Language-augmented scene representations hold great promise for large-scale robotics applications such as search-and-rescue, smart cities, and mining.<n>Many of these scenarios are time-sensitive, requiring rapid scene encoding while also being data-intensive, necessitating scalable solutions.<n>We introduce SLAG, a multi-GPU framework for language-augmented Gaussian splatting that enhances the speed and scalability of embedding large scenes.
arXiv Detail & Related papers (2025-05-12T23:32:24Z)
A-SCoRe: Attention-based Scene Coordinate Regression for wide-ranging scenarios [1.2093553114715083]
A-ScoRe is an Attention-based model which leverage attention on descriptor map level to produce meaningful and high-semantic 2D descriptors. Results show our methods achieve comparable performance with State-of-the-art methods on multiple benchmark while being light-weighted and much more flexible.
arXiv Detail & Related papers (2025-03-18T07:39:50Z)
Learning the RoPEs: Better 2D and 3D Position Encodings with STRING [34.997879460336826]
STRING: Separable Translationally Invariant Position s. We introduce STRING: Separable Translationally Invariant Position s.
arXiv Detail & Related papers (2025-02-04T18:37:17Z)
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning [55.339257446600634]
We introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data. We construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks.
arXiv Detail & Related papers (2024-09-30T21:55:38Z)
RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation [9.121372333621538]
Whole-body pose estimation aims to predict fine-grained pose information for the human body. We present RTMW (Real-Time Multi-person Whole-body pose estimation models), a series of high-performance models for 2D/3D whole-body pose estimation.
arXiv Detail & Related papers (2024-07-11T16:15:47Z)
Implicit-Zoo: A Large-Scale Dataset of Neural Implicit Functions for 2D Images and 3D Scenes [65.22070581594426]
"Implicit-Zoo" is a large-scale dataset requiring thousands of GPU training days to facilitate research and development in this field. We showcase two immediate benefits as it enables to: (1) learn token locations for transformer models; (2) directly regress 3D cameras poses of 2D images with respect to NeRF models. This in turn leads to an improved performance in all three task of image classification, semantic segmentation, and 3D pose regression, thereby unlocking new avenues for research.
arXiv Detail & Related papers (2024-06-25T10:20:44Z)
3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding [12.335958945925437]
We propose a novel rotary position encoding on a three-dimensional sphere, named 3D Rotary Position (3D-RPE) 3D-RPE is an advanced version of the widely used 2D Rotary Position (RoPE) For controllable long-term decay, 3D-RPE allows for the regulation of long-term decay within the chunk size. For enhanced position resolution, 3D-RPE can mitigate the degradation of position resolution caused by position on RoPE.
arXiv Detail & Related papers (2024-06-14T10:13:37Z)
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.<n>The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z)
RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception [98.76525636842177]
RoScenes is the largest multi-view roadside perception dataset. Our dataset achieves surprising 21.13M 3D annotations within 64,000 $m2$.
arXiv Detail & Related papers (2024-05-16T08:06:52Z)
Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization [51.33923845954759]
3D Visual Grounding (3DVG) and 3D Captioning (3DDC) are two crucial tasks in various 3D applications. We propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks. In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection.
arXiv Detail & Related papers (2024-04-17T04:46:27Z)
Rotary Position Embedding for Vision Transformer [44.27871591624888]
This study provides a comprehensive analysis of Rotary Position Embedding (RoPE) when applied to Vision Transformer (ViT) RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation.
arXiv Detail & Related papers (2024-03-20T04:47:13Z)
NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space [77.6067460464962]
Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs. We identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Imbalance in the 3D convolution across different depth levels. We devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2
arXiv Detail & Related papers (2023-09-26T02:09:52Z)
V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection [73.37781484123536]
We introduce a highly performant 3D object detector for point clouds using the DETR framework. To address the limitation, we introduce a novel 3D Relative Position (3DV-RPE) method. We show exceptional results on the challenging ScanNetV2 benchmark.
arXiv Detail & Related papers (2023-08-08T17:14:14Z)
For SALE: State-Action Representation Learning for Deep Reinforcement Learning [60.42044715596703]
SALE is a novel approach for learning embeddings that model the nuanced interaction between state and action. We integrate SALE and an adaptation of checkpoints for RL into TD3 to form the TD7 algorithm. On OpenAI gym benchmark tasks, TD7 has an average performance gain of 276.7% and 50.7% over TD3 at 300k and 5M time steps, respectively.
arXiv Detail & Related papers (2023-06-04T19:47:46Z)
Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers [71.32827362323205]
We propose a new class of linear Transformers calledLearner-Transformers (Learners) They incorporate a wide range of relative positional encoding mechanisms (RPEs) These include regular RPE techniques applied for sequential data, as well as novel RPEs operating on geometric data embedded in higher-dimensional Euclidean spaces.
arXiv Detail & Related papers (2023-02-03T18:57:17Z)
CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds [55.44204039410225]
We present a novel two-stage fully sparse convolutional 3D object detection framework, named CAGroup3D. Our proposed method first generates some high-quality 3D proposals by leveraging the class-aware local group strategy on the object surface voxels. To recover the features of missed voxels due to incorrect voxel-wise segmentation, we build a fully sparse convolutional RoI pooling module.
arXiv Detail & Related papers (2022-10-09T13:38:48Z)
The Devil is in the Pose: Ambiguity-free 3D Rotation-invariant Learning via Pose-aware Convolution [18.595285633151715]
We develop a Pose-aware Rotation Invariant Convolution (i.e., PaRI-Conv) We propose an Augmented Point Pair Feature (APPF) to fully encode the RI relative pose information, and a factorized dynamic kernel for pose-aware kernel generation. Our PaRI-Conv surpasses the state-of-the-art RI methods while being more compact and efficient.
arXiv Detail & Related papers (2022-05-30T16:11:55Z)
Simple and Effective Synthesis of Indoor 3D Scenes [78.95697556834536]
We study the problem of immersive 3D indoor scenes from one or more images. Our aim is to generate high-resolution images and videos from novel viewpoints. We propose an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images.
arXiv Detail & Related papers (2022-04-06T17:54:46Z)
Rethinking and Improving Relative Position Encoding for Vision Transformer [61.559777439200744]
Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. We propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE)
arXiv Detail & Related papers (2021-07-29T17:55:10Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
RoFormer: Enhanced Transformer with Rotary Position Embedding [9.01819510933327]
We propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. We evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets.
arXiv Detail & Related papers (2021-04-20T09:54:06Z)
Making a Case for 3D Convolutions for Object Segmentation in Videos [16.167397418720483]
We show that 3D convolutional networks can be effectively applied to dense video prediction tasks such as salient object segmentation. We propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules. Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal benchmarks.
arXiv Detail & Related papers (2020-08-26T12:24:23Z)
Searching Collaborative Agents for Multi-plane Localization in 3D Ultrasound [59.97366727654676]
3D ultrasound (US) is widely used due to its rich diagnostic information, portability and low cost. Standard plane (SP) localization in US volume not only improves efficiency and reduces user-dependence, but also boosts 3D US interpretation. We propose a novel Multi-Agent Reinforcement Learning framework to localize multiple uterine SPs in 3D US simultaneously.
arXiv Detail & Related papers (2020-07-30T07:23:55Z)
Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras. We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points. Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.