PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs
- URL: http://arxiv.org/abs/2511.10979v1
- Date: Fri, 14 Nov 2025 05:56:47 GMT
- Title: PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs
- Authors: Bowen Sun, Yujun Cai, Ming-Hsuan Yang, Hang Wu, Yiwei Wang,
- Abstract summary: Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames.<n>We present Phase Aggregated Smoothing (PAS), a training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs.<n>Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling.
- Score: 57.790910044227935
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames. We trace this instability to the common extension of Rotary Position Embeddings to video through multimodal RoPE. The induced inverse Fourier time kernel exhibits frame-scale ripples that multiply adjacent frames by different factors, which perturbs attention that should otherwise be governed by the raw query key inner product. We present Phase Aggregated Smoothing (PAS), a simple, training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs. PAS preserves the per-head spectrum magnitude, while the aggregation effectively smooths the temporal kernel and reduces phase sensitivity without changing the positional encoding structure. Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling. Experiments on multiple video understanding benchmarks under matched token budgets show consistent improvements with negligible computational overhead. PAS provides a plug and play upgrade for robust temporal encoding in Video LLMs.
Related papers
- Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers [95.68243351895107]
We propose a holistic, video-centric paradigm named textbfLocal textbfDiffusion textbfForcing for textbfVideo textbfFrame textbfInterpolation (LDF-VFI)<n>Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence.<n>LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per
arXiv Detail & Related papers (2026-01-21T12:58:52Z) - DiTVR: Zero-Shot Diffusion Transformer for Video Restoration [48.97196894658511]
DiTVR is a zero shot video restoration framework that couples a diffusion transformer with trajectory aware attention and a flow consistent sampler.<n>Our attention mechanism aligns tokens along optical flow trajectories, with particular emphasis on vital layers that exhibit the highest sensitivity to temporal dynamics.<n>The flow guided sampler injects data consistency only into low-frequency bands, preserving high frequency priors while accelerating cache.
arXiv Detail & Related papers (2025-08-11T09:54:45Z) - VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption [19.819820303839613]
Existing tokenizers provide fixed temporal compression rates, causing the computational cost of the diffusion model to scale linearly with the frame rate.<n>The paper introduces VFRTok, a Transformer-based video tokenizer, that enables variable frame rate encoding and decoding.<n> VFRTok achieves competitive reconstruction quality and state-of-the-art video fidelity while using only 1/8 tokens compared to existing tokenizers.
arXiv Detail & Related papers (2025-05-17T15:32:54Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - SNeRV: Spectra-preserving Neural Representation for Video [8.978061470104532]
We propose spectra-preserving NeRV (SNeRV) as a novel approach to enhance implicit video representations.<n>In this paper, we use 2D discrete wavelet transform (DWT) to decompose video into low-frequency (LF) and high-frequency (HF) features.<n>We demonstrate that SNeRV outperforms existing NeRV models in capturing fine details and achieves enhanced reconstruction.
arXiv Detail & Related papers (2025-01-03T07:57:38Z) - Enhancing Long Video Generation Consistency without Tuning [92.1714656167712]
We address issues to enhance the consistency and coherence of videos generated with either single or multiple prompts.<n>We propose the Time-frequency based temporal Attention Reweighting Algorithm (TiARA), which judiciously edits the attention score matrix.<n>For videos generated by multiple prompts, we further uncover key factors such as the alignment of the prompts affecting prompt quality.<n>Inspired by our analyses, we propose PromptBlend, an advanced prompt pipeline that systematically aligns the prompts.
arXiv Detail & Related papers (2024-12-23T03:56:27Z) - Alignment-free Raw Video Demoireing [18.06907326360215]
Video demoireing aims to remove undesirable interference patterns that arise during the capture of screen content.<n>This paper introduces a novel alignment-free raw video demoireing network with frequency-assisted temporal Mamba (DemMamba)<n>It surpasses state-of-the-art methods by 1.6 dB in PSNR, and also delivers a satisfactory visual experience.
arXiv Detail & Related papers (2024-08-20T09:31:03Z) - GPU-accelerated SIFT-aided source identification of stabilized videos [63.084540168532065]
We exploit the parallelization capabilities of Graphics Processing Units (GPUs) in the framework of stabilised frames inversion.
We propose to exploit SIFT features.
to estimate the camera momentum and %to identify less stabilized temporal segments.
Experiments confirm the effectiveness of the proposed approach in reducing the required computational time and improving the source identification accuracy.
arXiv Detail & Related papers (2022-07-29T07:01:31Z) - Flow-Guided Sparse Transformer for Video Deblurring [124.11022871999423]
FlowGuided Sparse Transformer (F GST) is a framework for video deblurring.
FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse elements corresponding to the same scene patch in neighboring frames.
Our proposed F GST outperforms state-of-the-art patches on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring.
arXiv Detail & Related papers (2022-01-06T02:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.