Related papers: VRoPE: Rotary Position Embedding for Video Large Language Models

VRoPE: Rotary Position Embedding for Video Large Language Models

URL: http://arxiv.org/abs/2502.11664v1
Date: Mon, 17 Feb 2025 10:53:57 GMT
Title: VRoPE: Rotary Position Embedding for Video Large Language Models
Authors: Zikang Liu, Longteng Guo, Yepeng Tang, Junxian Cai, Kai Ma, Xi Chen, Jing Liu,
Abstract summary: Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs)<n>Video adaptations, such as RoPE-3D, attempt to encode spatial and temporal dimensions separately but suffer from two major limitations.<n>We propose Position Rotary Embedding (VRoPE), a novel positional encoding method tailored for Video-LLMs.
Score: 14.292586301871196
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Rotary Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs), but extending it to video remains a challenge due to the intricate spatiotemporal structure of video frames. Existing adaptations, such as RoPE-3D, attempt to encode spatial and temporal dimensions separately but suffer from two major limitations: positional bias in attention distribution and disruptions in video-text transitions. To overcome these issues, we propose Video Rotary Position Embedding (VRoPE), a novel positional encoding method tailored for Video-LLMs. Our approach restructures positional indices to preserve spatial coherence and ensure a smooth transition between video and text tokens. Additionally, we introduce a more balanced encoding strategy that mitigates attention biases, ensuring a more uniform distribution of spatial focus. Extensive experiments on Vicuna and Qwen2 across different model scales demonstrate that VRoPE consistently outperforms previous RoPE variants, achieving significant improvements in video understanding, temporal reasoning, and retrieval tasks. Code will be available at https://github.com/johncaged/VRoPE

Related papers

LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z)
HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models [4.105127179940934]
Vision-Language Models (VLMs) have made significant progress in multimodal tasks.<n>However, their performance often deteriorates in long-context scenarios.<n>We propose HoPE, a Hybrid of Position Embedding to improve the long-context capabilities ofVLMs.
arXiv Detail & Related papers (2025-05-26T18:37:40Z)
Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models [35.471513870514585]
Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models.<n>RoPE variants enforce relative positional dependencies between text token indices and image tokens, causing spurious alignments.<n>We introduce Circle-RoPE, a novel encoding scheme that maps image token indices onto a circular trajectory to the linear path of text token indices, forming a cone-like structure.
arXiv Detail & Related papers (2025-05-22T09:05:01Z)
VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate [16.826081397057774]
VGDFR is a training-free approach for Diffusion-based Video Generation with Dynamic Latent Frame Rate. We show that VGDFR can achieve a speedup up to 3x for video generation with minimal quality degradation.
arXiv Detail & Related papers (2025-04-16T17:09:13Z)
Exploiting Temporal State Space Sharing for Video Semantic Segmentation [53.8810901249897]
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes. Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements. We introduce a Temporal Video State Space Sharing architecture to leverage Mamba state space models for temporal feature sharing. Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool.
arXiv Detail & Related papers (2025-03-26T01:47:42Z)
HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models [63.65066762436074]
HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks. It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks.
arXiv Detail & Related papers (2025-03-14T15:36:39Z)
VideoRoPE: What Makes for Good Video Rotary Position Embedding? [109.88966080843608]
VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination.<n>VideoRoPE features textlow-frequency temporal allocation to mitigate periodic oscillations, a textitdiagonal layout to maintain spatial symmetry, and textadjustable temporal spacing to decouple temporal and spatial indexing.
arXiv Detail & Related papers (2025-02-07T18:56:04Z)
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. We leverage DINOv2 features to remove redundant frames that exhibit high similarity. We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding [20.16000249533665]
TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video block. We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks.
arXiv Detail & Related papers (2023-10-29T16:25:32Z)
DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing [48.086102360155856]
We introduce the dynamic Neural Radiance Fields (NeRF) as the innovative video representation. We propose the image-based video-NeRF editing pipeline with a set of innovative designs to provide consistent and controllable editing. Our method, dubbed as DynVideo-E, significantly outperforms SOTA approaches on two challenging datasets by a large margin of 50% 95% for human preference.
arXiv Detail & Related papers (2023-10-16T17:48:10Z)
Unsupervised Video Domain Adaptation for Action Recognition: A Disentanglement Perspective [37.45565756522847]
We consider the generation of cross-domain videos from two sets of latent factors. TranSVAE framework is then developed to model such generation. Experiments on the UCF-HMDB, Jester, and Epic-Kitchens datasets verify the effectiveness and superiority of TranSVAE.
arXiv Detail & Related papers (2022-08-15T17:59:31Z)
Stand-Alone Inter-Frame Attention in Video Models [164.06137994796487]
We present a new recipe of inter-frame attention block, namely Stand-alone Inter-temporal Attention (SIFA) SIFA remoulds the deformable design via re-scaling the offset predictions by the difference between two frames. We further plug SIFA block into ConvNets and Vision Transformer, respectively, to devise SIFA-Net and SIFA-Transformer.
arXiv Detail & Related papers (2022-06-14T15:51:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.