FastVID: Dynamic Density Pruning for Fast Video Large Language Models
- URL: http://arxiv.org/abs/2503.11187v1
- Date: Fri, 14 Mar 2025 08:33:08 GMT
- Title: FastVID: Dynamic Density Pruning for Fast Video Large Language Models
- Authors: Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Guiguang Ding,
- Abstract summary: We propose Dynamic Density Pruning for Fast Video LLMs termed FastVID.<n>FastVID partitions videos into temporally ordered segments to preserve temporal structure and applies a density-based token pruning strategy to maintain essential visual information.<n>Our method significantly reduces computational overhead while maintaining temporal and visual integrity.
- Score: 38.267065642416554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Large Language Models have shown impressive capabilities in video comprehension, yet their practical deployment is hindered by substantial inference costs caused by redundant video tokens. Existing pruning techniques fail to fully exploit the spatiotemporal redundancy inherent in video data. To bridge this gap, we perform a systematic analysis of video redundancy from two perspectives: temporal context and visual context. Leveraging this insight, we propose Dynamic Density Pruning for Fast Video LLMs termed FastVID. Specifically, FastVID dynamically partitions videos into temporally ordered segments to preserve temporal structure and applies a density-based token pruning strategy to maintain essential visual information. Our method significantly reduces computational overhead while maintaining temporal and visual integrity. Extensive evaluations show that FastVID achieves state-of-the-art performance across various short- and long-video benchmarks on leading Video LLMs, including LLaVA-OneVision and LLaVA-Video. Notably, FastVID effectively prunes 90% of video tokens while retaining 98.0% of LLaVA-OneVision's original performance. The code is available at https://github.com/LunarShen/FastVID.
Related papers
- An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes [85.00111442236499]
This paper presents textbfQuicksviewer, an LMM with new perceiving paradigm that partitions a video of nontemporal density into varying cubes using Gumbel Softmax.
We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency.
With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy.
arXiv Detail & Related papers (2025-04-21T17:57:21Z) - Slow-Fast Architecture for Video Multi-Modal Large Language Models [42.3957835391319]
Existing methods compress video representations using predefined rules before feeding them into the multi-modal large language model.
We propose a novel slow-fast architecture that naturally circumvents this trade-off, enabling the use of more input frames while preserving spatial details.
Our model significantly outperforms self-attention-only baselines, extending the input capacity from 16 to 128 frames with just a 3% increase in computation.
arXiv Detail & Related papers (2025-04-02T03:24:58Z) - VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers [23.541896057977745]
VideoScan is an efficient vision-language model (VLM) inference framework for real-time video interaction.<n>VideoScan employs a single semantic carrier token to represent each frame.
arXiv Detail & Related papers (2025-03-12T13:30:40Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding [55.320254859515714]
We introduce a training-free method, $bfReTaKe$, to reduce both temporal visual redundancy and knowledge redundancy for long video understanding.<n> DPSelect identifies Videos with local maximum peak distance based on their visual features, which are closely aligned with human video perception.<n> PivotKV employs VideoBenchs as pivots and conducts KV-Cache compression for the non-text tokens with low attention scores.
arXiv Detail & Related papers (2024-12-29T15:42:24Z) - SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation [153.46240555355408]
SlowFast-VGen is a novel dual-speed learning system for action-driven long video generation.
Our approach incorporates a conditional video diffusion model for the slow learning of world dynamics.
We propose a slow-fast learning loop algorithm that seamlessly integrates the inner fast learning loop into the outer slow learning loop.
arXiv Detail & Related papers (2024-10-30T17:55:52Z) - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models [51.712700398020075]
We propose a training-free video large language model (LLM) that can jointly capture detailed spatial semantics and long-range temporal context.
This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled frames in an effective way.
Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks.
arXiv Detail & Related papers (2024-07-22T17:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.