VideoRoPE: What Makes for Good Video Rotary Position Embedding?
- URL: http://arxiv.org/abs/2502.05173v1
- Date: Fri, 07 Feb 2025 18:56:04 GMT
- Title: VideoRoPE: What Makes for Good Video Rotary Position Embedding?
- Authors: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin,
- Abstract summary: VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination.
VideoRoPE features textlow-frequency temporal allocation to mitigate periodic oscillations, a textitdiagonal layout to maintain spatial symmetry, and textadjustable temporal spacing to decouple temporal and spatial indexing.
- Score: 109.88966080843608
- License:
- Abstract: While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce \textbf{VideoRoPE}, with a \textit{3D structure} designed to preserve spatio-temporal relationships. VideoRoPE features \textit{low-frequency temporal allocation} to mitigate periodic oscillations, a \textit{diagonal layout} to maintain spatial symmetry, and \textit{adjustable temporal spacing} to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at \href{https://github.com/Wiselnn570/VideoRoPE}{https://github.com/Wiselnn570/VideoRoPE}.
Related papers
- VRoPE: Rotary Position Embedding for Video Large Language Models [14.292586301871196]
Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs)
Video adaptations, such as RoPE-3D, attempt to encode spatial and temporal dimensions separately but suffer from two major limitations.
We propose Position Rotary Embedding (VRoPE), a novel positional encoding method tailored for Video-LLMs.
arXiv Detail & Related papers (2025-02-17T10:53:57Z) - HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation [19.42279057349193]
positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion.
We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information.
arXiv Detail & Related papers (2024-10-28T17:01:52Z) - Round and Round We Go! What makes Rotary Positional Encodings useful? [15.543752938828831]
We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level.
We find that Gemma learns to use RoPE to construct robust "positional" attention patterns by exploiting the highest frequencies.
We propose a modification of RoPE that fixes some highlighted issues and improves performance.
arXiv Detail & Related papers (2024-10-08T17:07:01Z) - 3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding [12.335958945925437]
We propose a novel rotary position encoding on a three-dimensional sphere, named 3D Rotary Position (3D-RPE)
3D-RPE is an advanced version of the widely used 2D Rotary Position (RoPE)
For controllable long-term decay, 3D-RPE allows for the regulation of long-term decay within the chunk size.
For enhanced position resolution, 3D-RPE can mitigate the degradation of position resolution caused by position on RoPE.
arXiv Detail & Related papers (2024-06-14T10:13:37Z) - Rotary Position Embedding for Vision Transformer [44.27871591624888]
This study provides a comprehensive analysis of Rotary Position Embedding (RoPE) when applied to Vision Transformer (ViT)
RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference.
It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation.
arXiv Detail & Related papers (2024-03-20T04:47:13Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding
in Long Videos [60.86880787242561]
Video temporal grounding aims to pinpoint a video segment that matches the query description.
We propose an end-to-end framework for fast temporal grounding, which is able to model an hours-long video with textbfone-time network execution.
Our method significantly outperforms state-of-the-arts, and achieves textbf14.6$times$ / textbf102.8$times$ higher efficiency respectively.
arXiv Detail & Related papers (2023-03-15T03:54:43Z) - Temporal RoI Align for Video Object Recognition [107.07049115214924]
The proposed Temporal RoI Align operator can extract temporal information from the entire video for proposals.
We integrate it into single-frame video detectors and other state-of-the-art video detectors, and conduct quantitative experiments to demonstrate that the proposed Temporal RoI Align operator can consistently and significantly boost the performance.
arXiv Detail & Related papers (2021-09-08T08:35:21Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form
Sentences [107.0776836117313]
Given an un-trimmed video and a declarative/interrogative sentence, STVG aims to localize the-temporal tube of the object queried.
Existing methods cannot tackle the STVG task due to the ineffective tube pre-generation and the lack of novel object relationship modeling.
We present a declarative-Temporal Graph Reasoning Network (STGRN) for this task.
arXiv Detail & Related papers (2020-01-19T19:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.