3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding
- URL: http://arxiv.org/abs/2406.09897v1
- Date: Fri, 14 Jun 2024 10:13:37 GMT
- Title: 3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding
- Authors: Xindian Ma, Wenyuan Liu, Peng Zhang, Nan Xu,
- Abstract summary: We propose a novel rotary position encoding on a three-dimensional sphere, named 3D Rotary Position (3D-RPE)
3D-RPE is an advanced version of the widely used 2D Rotary Position (RoPE)
For controllable long-term decay, 3D-RPE allows for the regulation of long-term decay within the chunk size.
For enhanced position resolution, 3D-RPE can mitigate the degradation of position resolution caused by position on RoPE.
- Score: 12.335958945925437
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inspired by the Bloch Sphere representation, we propose a novel rotary position encoding on a three-dimensional sphere, named 3D Rotary Position Encoding (3D-RPE). 3D-RPE is an advanced version of the widely used 2D Rotary Position Encoding (RoPE), with two major advantages for modeling long contexts: controllable long-term decay and improved position resolution. For controllable long-term decay, 3D-RPE allows for the regulation of long-term decay within the chunk size, ensuring the modeling of relative positional information between tokens at a distant relative position. For enhanced position resolution, 3D-RPE can mitigate the degradation of position resolution caused by position interpolation on RoPE. We have conducted experiments on long-context Natural Language Understanding (NLU) and long-sequence Language Modeling (LM) tasks. From the experimental results, 3D-RPE achieved performance improvements over RoPE, especially in long-context NLU tasks.
Related papers
- HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation [19.42279057349193]
positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion.
We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information.
arXiv Detail & Related papers (2024-10-28T17:01:52Z) - SparseFusion: Efficient Sparse Multi-Modal Fusion Framework for Long-Range 3D Perception [47.000734648271006]
We introduce SparseFusion, a novel multi-modal fusion framework built upon sparse 3D features to facilitate efficient long-range perception.
The proposed module introduces sparsity from both semantic and geometric aspects which only fill grids that foreground objects potentially reside in.
On the long-range Argoverse2 dataset, SparseFusion reduces memory footprint and accelerates the inference by about two times compared to dense detectors.
arXiv Detail & Related papers (2024-03-15T05:59:10Z) - NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized
Device Coordinates Space [77.6067460464962]
Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs.
We identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Imbalance in the 3D convolution across different depth levels.
We devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2
arXiv Detail & Related papers (2023-09-26T02:09:52Z) - MonoPGC: Monocular 3D Object Detection with Pixel Geometry Contexts [6.639648061168067]
We propose MonoPGC, a novel end-to-end Monocular 3D object detection framework with rich Pixel Geometry Contexts.
We introduce the pixel depth estimation as our auxiliary task and design depth cross-attention pyramid module (DCPM) to inject local and global depth geometry knowledge into visual features.
In addition, we present the depth-space-aware transformer (DSAT) to integrate 3D space position and depth-aware features efficiently.
arXiv Detail & Related papers (2023-02-21T09:21:58Z) - Focal-PETR: Embracing Foreground for Efficient Multi-Camera 3D Object
Detection [11.13693561702228]
The dominant multi-camera 3D detection paradigm is based on explicit 3D feature construction.
Other methods implicitly introduce geometric positional encoding to build the relationship between image tokens and 3D objects.
We propose Focal-PETR with instance-guided supervision and spatial alignment module.
arXiv Detail & Related papers (2022-12-11T13:38:54Z) - PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images [105.29493158036105]
PETRv2 is a unified framework for 3D perception from multi-view images.
We extend the 3D position embedding in PETR for temporal modeling.
PETRv2 achieves state-of-the-art performance on 3D object detection and BEV segmentation.
arXiv Detail & Related papers (2022-06-02T19:13:03Z) - Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object
Detection [89.66162518035144]
We present a flexible and high-performance framework, named Pyramid R-CNN, for two-stage 3D object detection from point clouds.
We propose a novel second-stage module, named pyramid RoI head, to adaptively learn the features from the sparse points of interest.
Our pyramid RoI head is robust to the sparse and imbalanced circumstances, and can be applied upon various 3D backbones to consistently boost the detection performance.
arXiv Detail & Related papers (2021-09-06T14:17:51Z) - Learning Anchored Unsigned Distance Functions with Gradient Direction
Alignment for Single-view Garment Reconstruction [92.23666036481399]
We propose a novel learnable Anchored Unsigned Distance Function (AnchorUDF) representation for 3D garment reconstruction from a single image.
AnchorUDF represents 3D shapes by predicting unsigned distance fields (UDFs) to enable open garment surface modeling at arbitrary resolution.
arXiv Detail & Related papers (2021-08-19T03:45:38Z) - Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo [71.59494156155309]
Existing approaches for multi-view 3D pose estimation explicitly establish cross-view correspondences to group 2D pose detections from multiple camera views.
We present our multi-view 3D pose estimation approach based on plane sweep stereo to jointly address the cross-view fusion and 3D pose reconstruction in a single shot.
arXiv Detail & Related papers (2021-04-06T03:49:35Z) - End-to-End Pseudo-LiDAR for Image-Based 3D Object Detection [62.34374949726333]
Pseudo-LiDAR (PL) has led to a drastic reduction in the accuracy gap between methods based on LiDAR sensors and those based on cheap stereo cameras.
PL combines state-of-the-art deep neural networks for 3D depth estimation with those for 3D object detection by converting 2D depth map outputs to 3D point cloud inputs.
We introduce a new framework based on differentiable Change of Representation (CoR) modules that allow the entire PL pipeline to be trained end-to-end.
arXiv Detail & Related papers (2020-04-07T02:18:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.