DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding
- URL: http://arxiv.org/abs/2506.03990v1
- Date: Wed, 04 Jun 2025 14:17:42 GMT
- Title: DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding
- Authors: Hongzhi Zhang, Jingyuan Zhang, Xingguang Ji, Qi Wang, Fuzheng Zhang,
- Abstract summary: We introduce DynTok, a novel textbfDynamic video textbfToken compression strategy.<n>Our method reduces the number of tokens to 44.4% of the original size while maintaining comparable performance.
- Score: 17.319420726271876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Typical video modeling methods, such as LLava, represent videos as sequences of visual tokens, which are then processed by the LLM backbone for effective video understanding. However, this approach leads to a massive number of visual tokens, especially for long videos. A practical solution is to first extract relevant visual information from the large visual context before feeding it into the LLM backbone, thereby reducing computational overhead. In this work, we introduce DynTok, a novel \textbf{Dyn}amic video \textbf{Tok}en compression strategy. DynTok adaptively splits visual tokens into groups and merges them within each group, achieving high compression in regions with low information density while preserving essential content. Our method reduces the number of tokens to 44.4% of the original size while maintaining comparable performance. It further benefits from increasing the number of video frames and achieves 65.3% on Video-MME and 72.5% on MLVU. By applying this simple yet effective compression method, we expose the redundancy in video token representations and offer insights for designing more efficient video modeling techniques.
Related papers
- Clapper: Compact Learning and Video Representation in VLMs [15.564506713994406]
Current vision-language models (VLMs) have demonstrated remarkable capabilities across diverse video understanding applications.<n>We propose Clapper, a method that utilizes a slow-fast strategy for video representation and introduces a novel module named TimePerceiver for efficient temporal-spatial encoding.
arXiv Detail & Related papers (2025-05-21T13:52:17Z) - An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes [85.00111442236499]
This paper presents textbfQuicksviewer, an LMM with new perceiving paradigm that partitions a video of nontemporal density into varying cubes using Gumbel Softmax.<n>We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency.<n>With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy.
arXiv Detail & Related papers (2025-04-21T17:57:21Z) - REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding [2.309018557701645]
Recent methods often compress memory banks to handle untemporal videos for video-level understanding.<n>To this, we designed video to compress un videos on a large scale using visual tokens.
arXiv Detail & Related papers (2025-04-07T20:36:34Z) - InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression [1.8893427856534721]
We propose InternVL-X, which outperforms the InternVL model in both performance and efficiency.<n>By utilizing 20% or fewer visual tokens, InternVL-X achieves state-of-the-art performance on 7 public MLLM benchmarks, and improves the average metric by 2.34% across 12 tasks.
arXiv Detail & Related papers (2025-03-27T09:31:35Z) - Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding [12.215829700340988]
Video-XL-Pro is an efficient method for extremely long video understanding.<n>Video-XL-Pro can process over 8K frames on a single A100 GPU.
arXiv Detail & Related papers (2025-03-24T09:21:48Z) - ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding [55.320254859515714]
ReTaKe enables VideoLLMs to process 8 times longer frames (up to 2048), similar-sized models by 3-5% and even rivaling much larger ones on VideoMME, MLVU, LongVideoBench, and LVBench.<n>Our code is available at https://github.com/SCZwangxiao/video-ReTaKe.
arXiv Detail & Related papers (2024-12-29T15:42:24Z) - PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models [64.9366388601049]
Visual token compression is leveraged to reduce the considerable token length of visual inputs.<n>We introduce a unified token compression strategy called Progressive Visual Token Compression.<n>Our model achieves state-of-the-art performance across various video understanding benchmarks.
arXiv Detail & Related papers (2024-12-12T18:59:40Z) - Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM [28.64108439552772]
We introduce a large-scale synthetic dataset created from proprietary models.<n>We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance.<n>Our proposed model achieves state-of-the-art results across various video tasks and shows impressive generalization.
arXiv Detail & Related papers (2024-12-12T18:20:41Z) - TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models [52.590072198551944]
Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents.
For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data.
In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM.
arXiv Detail & Related papers (2024-11-17T13:08:29Z) - VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens.
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Towards Scalable Neural Representation for Diverse Videos [68.73612099741956]
Implicit neural representations (INR) have gained increasing attention in representing 3D scenes and images.
Existing INR-based methods are limited to encoding a handful of short videos with redundant visual content.
This paper focuses on developing neural representations for encoding long and/or a large number of videos with diverse visual content.
arXiv Detail & Related papers (2023-03-24T16:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.