Related papers: ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding

ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding

URL: http://arxiv.org/abs/2412.20504v2
Date: Sun, 05 Jan 2025 14:11:48 GMT
Title: ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
Authors: Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, Liqiang Nie,
Abstract summary: We introduce a training-free method, $bfReTaKe$, to reduce both temporal visual redundancy and knowledge redundancy for long video understanding.<n> DPSelect identifies Videos with local maximum peak distance based on their visual features, which are closely aligned with human video perception.<n> PivotKV employs VideoBenchs as pivots and conducts KV-Cache compression for the non-text tokens with low attention scores.
Score: 55.320254859515714
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Large Language Models (VideoLLMs) have achieved remarkable progress in video understanding. However, existing VideoLLMs often inherit the limitations of their backbone LLMs in handling long sequences, leading to challenges for long video understanding. Common solutions either simply uniformly sample videos' frames or compress visual tokens, which focus primarily on low-level temporal visual redundancy, overlooking high-level knowledge redundancy. This limits the achievable compression rate with minimal loss. To this end. we introduce a training-free method, $\textbf{ReTaKe}$, containing two novel modules DPSelect and PivotKV, to jointly model and reduce both temporal visual redundancy and knowledge redundancy for long video understanding. Specifically, DPSelect identifies keyframes with local maximum peak distance based on their visual features, which are closely aligned with human video perception. PivotKV employs the obtained keyframes as pivots and conducts KV-Cache compression for the non-pivot tokens with low attention scores, which are derived from the learned prior knowledge of LLMs. Experiments on benchmarks VideoMME, MLVU, and LVBench, show that ReTaKe can support 4x longer video sequences with minimal performance loss (<1%) and outperform all similar-size VideoLLMs with 3%-5%, even surpassing or on par with much larger ones. Our code is available at https://github.com/SCZwangxiao/video-ReTaKe

Related papers

Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory [50.30283773196725]
Existing approaches rely on key-value caching to accumulate frame-level details over time, but use a limited number of tokens per frame.<n>We propose scaling the token budget to enable more granular-temporal understanding and reasoning.
arXiv Detail & Related papers (2026-02-20T18:59:50Z)
MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding [13.02027465520324]
We propose MARC, which integrates structured retrieval and RL-based distillation.<n>MARC achieves near-baseline accuracy using only one frame's tokens.<n>This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings.
arXiv Detail & Related papers (2025-10-09T08:07:19Z)
Clapper: Compact Learning and Video Representation in VLMs [15.564506713994406]
Current vision-language models (VLMs) have demonstrated remarkable capabilities across diverse video understanding applications.<n>We propose Clapper, a method that utilizes a slow-fast strategy for video representation and introduces a novel module named TimePerceiver for efficient temporal-spatial encoding.
arXiv Detail & Related papers (2025-05-21T13:52:17Z)
AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding [55.320254859515714]
Multimodal Large Language Models (MLLMs) have revolutionized video understanding, yet are still limited by context length when processing long videos. We propose AdaReTaKe, a training-free method that flexibly reduces visual redundancy by allocating compression ratios among time and layers with theoretical guarantees. Experiments on VideoMME, MLVU, LongVideoBench, and LVBench datasets demonstrate that AdaReTaKe outperforms existing methods by 2.3% and 2.8% for 7B and 72B models, respectively.
arXiv Detail & Related papers (2025-03-16T16:14:52Z)
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering [46.199493246921435]
Video Question Answering (VQA) in long videos poses the key challenge of extracting relevant information. We introduce BIMBA, an efficient state-space model to handle long-form videos.
arXiv Detail & Related papers (2025-03-12T17:57:32Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs. We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction [10.579335027350263]
AdaCM$2$ is an adaptive cross-modality memory reduction approach to video-text alignment on video streams. It achieves a 4.5% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65%.
arXiv Detail & Related papers (2024-11-19T18:04:13Z)
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. We leverage DINOv2 features to remove redundant frames that exhibit high similarity. We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z)
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding [25.61734041983714]
Video-XL is a novel approach that leverages MLLMs' inherent key-value sparsification capacity to condense the visual input.<n>Video-XL's effectiveness is verified from three aspects. First, it achieves a superior long-video understanding capability, outperforming state-of-the-art models of comparable sizes.
arXiv Detail & Related papers (2024-09-22T15:13:31Z)
Koala: Key frame-conditioned long video-LLM [70.52369588364992]
We propose a lightweight and self-supervised long video-LLM (Koala) to adapt pretrained vLLMs for generalizing to longer videos. Our approach outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
arXiv Detail & Related papers (2024-04-05T18:33:04Z)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.<n>We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
Use Your Head: Improving Long-Tail Video Recognition [28.506807977493434]
We demonstrate that, unlike naturally-collected video datasets and existing long-tail image benchmarks, current video benchmarks fall short on multiple long-tailed properties. We propose new video benchmarks that better assess long-tail recognition, by sampling subsets from two datasets: SSv2 and VideoLT.
arXiv Detail & Related papers (2023-04-03T17:09:47Z)
Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations. Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views. Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z)
Learning for Video Compression with Recurrent Auto-Encoder and Recurrent Probability Model [164.7489982837475]
This paper proposes a Recurrent Learned Video Compression (RLVC) approach with the Recurrent Auto-Encoder (RAE) and Recurrent Probability Model ( RPM) The RAE employs recurrent cells in both the encoder and decoder to exploit the temporal correlation among video frames. Our approach achieves the state-of-the-art learned video compression performance in terms of both PSNR and MS-SSIM.
arXiv Detail & Related papers (2020-06-24T08:46:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.