Think-Clip-Sample: Slow-Fast Frame Selection for Video Understanding
- URL: http://arxiv.org/abs/2601.11359v1
- Date: Fri, 16 Jan 2026 15:14:04 GMT
- Title: Think-Clip-Sample: Slow-Fast Frame Selection for Video Understanding
- Authors: Wenhui Tan, Ruihua Song, Jiaze Li, Jianzhong Ju, Zhenbo Luo,
- Abstract summary: Think-Clip-Sample (TCS) is a training-free framework that enhances long video understanding through two key components.<n>Extensive experiments on MLVU, LongVideoBench, and VideoMME demonstrate that TCS consistently improves performance across different MLLMs.
- Score: 18.742640666790827
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress in multi-modal large language models (MLLMs) has significantly advanced video understanding. However, their performance on long-form videos remains limited by computational constraints and suboptimal frame selection. We present Think-Clip-Sample (TCS), a training-free framework that enhances long video understanding through two key components: (i) Multi-Query Reasoning, which generates multiple queries to capture complementary aspects of the question and video; and (ii) Clip-level Slow-Fast Sampling, which adaptively balances dense local details and sparse global context. Extensive experiments on MLVU, LongVideoBench, and VideoMME demonstrate that TCS consistently improves performance across different MLLMs, boosting up to 6.9% accuracy, and is capable of achieving comparable accuracy with 50% fewer inference time cost, highlighting both efficiency and efficacy of TCS on long video understanding.
Related papers
- Towards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval [57.88666884515147]
We propose One-shot video-Clip based Retrieval AuGmentation (OneClip-RAG)<n>OneClip-RAG makes full use of the merits of video clips for augmented video understanding.<n>It is also equipped with a novel query-guided video chunking algorithm.
arXiv Detail & Related papers (2025-12-09T09:40:20Z) - FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding [55.700832127331324]
FLoC is an efficient visual token compression framework based on the facility location function.<n>Our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens.<n>Our approach is training-free, model-agnostic, and query-agnostic, providing a versatile solution.
arXiv Detail & Related papers (2025-10-31T17:29:39Z) - From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding [43.82717677801915]
Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks.<n>Their practical use is limited by the "needle in a haystack" problem: the massive number of visual tokens produced from raw video frames exhausts the model's context window.<n>We show that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding.
arXiv Detail & Related papers (2025-10-02T17:43:01Z) - TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding [25.675553077419274]
Multimodal Language Models (MLLMs) have demonstrated significant progress in vision tasks, yet they still face challenges when processing long-duration inputs.<n>We propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning.<n>Our TSPO state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.
arXiv Detail & Related papers (2025-08-06T12:03:36Z) - BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding [51.49345400300556]
Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks.<n>Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content.<n>We introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies.
arXiv Detail & Related papers (2025-03-27T13:18:40Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling [56.130911402831906]
This paper aims to improve the performance of video large language models (LM) via long and rich context (LRC) modeling.<n>We develop a new version of InternVideo2.5 with focus on enhancing the original MLLMs' ability to perceive fine-grained details in videos.<n> Experimental results demonstrate this unique designML LRC greatly improves the results of video MLLM in mainstream understanding benchmarks.
arXiv Detail & Related papers (2025-01-21T18:59:00Z) - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.