Related papers: Rotary Position Embedding for Vision Transformer

Rotary Position Embedding for Vision Transformer

URL: http://arxiv.org/abs/2403.13298v2
Date: Tue, 16 Jul 2024 04:54:56 GMT
Title: Rotary Position Embedding for Vision Transformer
Authors: Byeongho Heo, Song Park, Dongyoon Han, Sangdoo Yun,
Abstract summary: This study provides a comprehensive analysis of Rotary Position Embedding (RoPE) when applied to Vision Transformer (ViT) RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation.
Score: 44.27871591624888
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. The analysis reveals that RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study provides thorough guidelines to apply RoPE into ViT, promising improved backbone performance with minimal extra computational overhead. Our code and pre-trained models are available at https://github.com/naver-ai/rope-vit

Related papers

VideoRoPE: What Makes for Good Video Rotary Position Embedding? [109.88966080843608]
VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. VideoRoPE features textlow-frequency temporal allocation to mitigate periodic oscillations, a textitdiagonal layout to maintain spatial symmetry, and textadjustable temporal spacing to decouple temporal and spatial indexing.
arXiv Detail & Related papers (2025-02-07T18:56:04Z)
Benchmarking Rotary Position Embeddings for Automatic Speech Recognition [17.360059094663182]
Rotary Position Embedding (RoPE) encodes relative and absolute positional information in Transformer-based models. RoPE consistently achieves lower error rates compared to the currently widely used relative positional embedding. To facilitate further research, we release the implementation and all experimental recipes through the SpeechBrain toolkit.
arXiv Detail & Related papers (2025-01-10T15:30:46Z)
Round and Round We Go! What makes Rotary Positional Encodings useful? [15.543752938828831]
We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level. We find that Gemma learns to use RoPE to construct robust "positional" attention patterns by exploiting the highest frequencies. We propose a modification of RoPE that fixes some highlighted issues and improves performance.
arXiv Detail & Related papers (2024-10-08T17:07:01Z)
Optimizing ROI Benefits Vehicle ReID in ITS [4.599517515407009]
Vehicle re-identification (ReID) is a computer vision task that matches the same vehicle across different cameras or viewpoints in a surveillance system. This study explores whether optimal vehicle detection regions, guided by detection confidence scores, can enhance feature matching and ReID tasks.
arXiv Detail & Related papers (2024-07-13T18:15:06Z)
RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception [98.76525636842177]
RoScenes is the largest multi-view roadside perception dataset. Our dataset achieves surprising 21.13M 3D annotations within 64,000 $m2$.
arXiv Detail & Related papers (2024-05-16T08:06:52Z)
Semantic-Aware Transformation-Invariant RoI Align [26.823382081015055]
Two-stage detectors often have higher detection accuracy than one-stage detectors. We propose a novel RoI feature extractor, termed Semantic RoI Align (SRA) SRA is capable of extracting invariant RoI features under a variety of transformations for two-stage detectors.
arXiv Detail & Related papers (2023-12-15T08:50:00Z)
Scaling Laws of RoPE-based Extrapolation [103.33995311915864]
We propose textbftextitScaling Laws of RoPE-based Extrapolation to describe the relationship between the extrapolation performance and base value. We achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.
arXiv Detail & Related papers (2023-10-08T15:50:36Z)
Lightweight Vision Transformer with Bidirectional Interaction [63.65115590184169]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information. Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z)
A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z)
Rethinking and Improving Relative Position Encoding for Vision Transformer [61.559777439200744]
Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. We propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE)
arXiv Detail & Related papers (2021-07-29T17:55:10Z)
Conformer-based End-to-end Speech Recognition With Rotary Position Embedding [11.428057887454008]
We introduce rotary position embedding (RoPE) in the convolution-augmented transformer (conformer) RoPE encodes absolute positional information into the input sequence by a rotation matrix, and then naturally incorporates explicit relative position information into a self-attention module. Our model achieves a relative word error rate reduction of 8.70% and 7.27% over the conformer on test-clean and test-other sets of the LibriSpeech corpus respectively.
arXiv Detail & Related papers (2021-07-13T08:07:22Z)
RoFormer: Enhanced Transformer with Rotary Position Embedding [9.01819510933327]
We propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. We evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets.
arXiv Detail & Related papers (2021-04-20T09:54:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.