Related papers: Clapper: Compact Learning and Video Representation in VLMs

Clapper: Compact Learning and Video Representation in VLMs

URL: http://arxiv.org/abs/2505.15529v1
Date: Wed, 21 May 2025 13:52:17 GMT
Title: Clapper: Compact Learning and Video Representation in VLMs
Authors: Lingyu Kong, Hongzhi Zhang, Jingyuan Zhang, Jianzhao Huang, Kunze Li, Qi Wang, Fuzheng Zhang,
Abstract summary: Current vision-language models (VLMs) have demonstrated remarkable capabilities across diverse video understanding applications.<n>We propose Clapper, a method that utilizes a slow-fast strategy for video representation and introduces a novel module named TimePerceiver for efficient temporal-spatial encoding.
Score: 15.564506713994406
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current vision-language models (VLMs) have demonstrated remarkable capabilities across diverse video understanding applications. Designing VLMs for video inputs requires effectively modeling the temporal dimension (i.e. capturing dependencies across frames) and balancing the processing of short and long videos. Specifically, short videos demand preservation of fine-grained details, whereas long videos require strategic compression of visual information to handle extensive temporal contexts efficiently. However, our empirical analysis reveals a critical limitation: most existing VLMs suffer severe performance degradation in long video understanding tasks when compressing visual tokens below a quarter of their original visual tokens. To enable more effective modeling of both short and long video inputs, we propose Clapper, a method that utilizes a slow-fast strategy for video representation and introduces a novel module named TimePerceiver for efficient temporal-spatial encoding within existing VLM backbones. By using our method, we achieves 13x compression of visual tokens per frame (averaging 61 tokens/frame) without compromising QA accuracy. In our experiments, Clapper achieves 62.0% on VideoMME, 69.8% on MLVU, and 67.4% on TempCompass, all with fewer than 6,000 visual tokens per video. The code will be publicly available on the homepage.

Related papers

DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding [17.319420726271876]
We introduce DynTok, a novel textbfDynamic video textbfToken compression strategy.<n>Our method reduces the number of tokens to 44.4% of the original size while maintaining comparable performance.
arXiv Detail & Related papers (2025-06-04T14:17:42Z)
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding [29.719450799231705]
Vision-Language Models (VLMs) obtain frame-level understanding capabilities through multi-frame input.<n>Video Large Language Models (Video-LLMs) capture temporal relationships within visual features but are limited by the scarcity of high-quality video-text datasets.<n>We propose Lightweight Video Compression (LVC), a novel method featuring the Query-Attention Video Compression mechanism.
arXiv Detail & Related papers (2025-04-09T12:51:10Z)
AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding [55.320254859515714]
Multimodal Large Language Models (MLLMs) have revolutionized video understanding, yet are still limited by context length when processing long videos.<n>We propose AdaReTaKe, a training-free method that flexibly reduces visual redundancy by allocating compression ratios among time and layers with theoretical guarantees.<n>Experiments on VideoMME, MLVU, LongVideoBench, and LVBench datasets demonstrate that AdaReTaKe outperforms existing methods by 2.3% and 2.8% for 7B and 72B models, respectively.
arXiv Detail & Related papers (2025-03-16T16:14:52Z)
HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models [63.65066762436074]
HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks.<n>It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks.
arXiv Detail & Related papers (2025-03-14T15:36:39Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling [56.130911402831906]
This paper aims to improve the performance of video large language models (LM) via long and rich context (LRC) modeling.<n>We develop a new version of InternVideo2.5 with focus on enhancing the original MLLMs' ability to perceive fine-grained details in videos.<n> Experimental results demonstrate this unique designML LRC greatly improves the results of video MLLM in mainstream understanding benchmarks.
arXiv Detail & Related papers (2025-01-21T18:59:00Z)
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding [55.320254859515714]
ReTaKe enables VideoLLMs to process 8 times longer frames (up to 2048), similar-sized models by 3-5% and even rivaling much larger ones on VideoMME, MLVU, LongVideoBench, and LVBench.<n>Our code is available at https://github.com/SCZwangxiao/video-ReTaKe.
arXiv Detail & Related papers (2024-12-29T15:42:24Z)
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. We leverage DINOv2 features to remove redundant frames that exhibit high similarity. We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.