Related papers: Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior

Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior

URL: http://arxiv.org/abs/2512.06866v1
Date: Sun, 07 Dec 2025 14:42:10 GMT
Title: Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior
Authors: Yulin Li, Haokun Gui, Ziyang Fan, Junjie Wang, Bin Kang, Bin Chen, Zhuotao Tian,
Abstract summary: We propose Dynamic Token compression via LLM-guided Keyframe prior (DyToK)<n>Our analysis reveals that VLLM attention layers naturally encoding query-conditioned priors, by which DyToK dynamically adjusts per-frame token retention ratios.<n>Experiments demonstrate that DyToK achieves state-of-the-art efficiency-accuracy tradeoffs.
Score: 31.997025910713077
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent advances in Video Large Language Models (VLLMs) have achieved remarkable video understanding capabilities, yet face critical efficiency bottlenecks due to quadratic computational growth with lengthy visual token sequences of long videos. While existing keyframe sampling methods can improve temporal modeling efficiency, additional computational cost is introduced before feature encoding, and the binary frame selection paradigm is found suboptimal. Therefore, in this work, we propose Dynamic Token compression via LLM-guided Keyframe prior (DyToK), a training-free paradigm that enables dynamic token compression by harnessing VLLMs' inherent attention mechanisms. Our analysis reveals that VLLM attention layers naturally encoding query-conditioned keyframe priors, by which DyToK dynamically adjusts per-frame token retention ratios, prioritizing semantically rich frames while suppressing redundancies. Extensive experiments demonstrate that DyToK achieves state-of-the-art efficiency-accuracy tradeoffs. DyToK shows plug-and-play compatibility with existing compression methods, such as VisionZip and FastV, attaining 4.3x faster inference while preserving accuracy across multiple VLLMs, such as LLaVA-OneVision and Qwen2.5-VL. Code is available at https://github.com/yu-lin-li/DyToK .

Related papers

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models [56.76440182038839]
Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos.<n>Current methods use sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage.<n>We propose to leverage video primitives which encode video redundancy and sparsity without requiring expensive full-image encoding for most frames.
arXiv Detail & Related papers (2026-02-13T18:57:31Z)
LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs [52.24096832965001]
We present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method.<n>The PVC method can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding.<n>Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x.
arXiv Detail & Related papers (2025-11-26T08:11:10Z)
FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding [55.700832127331324]
FLoC is an efficient visual token compression framework based on the facility location function.<n>Our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens.<n>Our approach is training-free, model-agnostic, and query-agnostic, providing a versatile solution.
arXiv Detail & Related papers (2025-10-31T17:29:39Z)
LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z)
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)<n>Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.<n>Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z)
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation [20.67434288227437]
ViLAMP is a hierarchical video-language model that processes hour-long videos at "mixed precision"<n>ViLAMP retains full information ins while reducing non-keyframes to their most salient features, resembling mixed-precision training.<n> Notably, ViLAMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU.
arXiv Detail & Related papers (2025-04-03T09:55:09Z)
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching [23.52474883720957]
Vision-Language-Action (VLA) models have demonstrated strong multi-modal reasoning capabilities, enabling direct action generation from visual perception and language instructions.<n>This paper introduces VLA-Cache, a training-free inference acceleration method that reduces computational overhead by adaptively caching and reusing static visual tokens across frames.
arXiv Detail & Related papers (2025-02-04T09:48:14Z)
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models [28.379533608574814]
We present DyCoke, a training-free token compression method to optimize token representation and accelerate video large language models.<n>DyCoke incorporates a plug-and-play temporal compression module to minimize temporal redundancy by merging redundant tokens across frames.<n>It ensures high-quality inference by dynamically retaining the critical tokens at each decoding step.
arXiv Detail & Related papers (2024-11-22T15:55:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.