Related papers: Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs

Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs

URL: http://arxiv.org/abs/2510.17364v1
Date: Mon, 20 Oct 2025 10:04:49 GMT
Title: Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs
Authors: Vaggelis Dorovatas, Soroush Seifi, Gunshi Gupta, Rahaf Aljundi,
Abstract summary: We propose a training-free approach compatible with standard Video-LLMs.<n>Our attention-based selection allows us to discard up to 95% of unimportant visual tokens with minimal performance loss.<n>Our method achieves state-of-the-art performance on streaming video benchmarks.
Score: 7.06290511446344
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Large Language Models (Video-LLMs) excel at understanding videos in-context, provided they have full access to the video when answering queries. However, these models face challenges in streaming scenarios where hour-long videos must be processed online, and questions need timely responses. In this work, we propose a training-free approach compatible with standard Video-LLMs, leveraging three key concepts: 1) LLM-informed selection of visual tokens to identify those that the LLM has attended to and contributed to its understanding of each short clip. Our attention-based selection allows us to discard up to ~95% of unimportant visual tokens with minimal performance loss; 2) Recurrent processing of past selected tokens to generate temporally coherent understanding of each processed clip; 3) Caption-based question answering for lightweight and accurate responses. Our method achieves state-of-the-art performance on streaming video benchmarks, striking a balance between efficiency and effectiveness.

Related papers

FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering [0.6107667071306521]
We develop FocusGraph, a framework for question answering over long egocentric videos.<n>We use a lightweight trainable Scene-Caption LLM Selector that selects query-relevant clips based on their graph-based captions.<n>We then design a training-free Patch-wise Sparse-Flow Retention (PSFR) method to select textuals from the resulting sequence of clips.
arXiv Detail & Related papers (2026-03-04T18:14:00Z)
An Empirical Study for Representations of Videos in Video Question Answering via MLLMs [4.726627693005334]
Multimodal large language models have recently achieved remarkable progress in video question answering.<n>It remains unclear which video representations are most effective for MLLMs, and how different modalities balance task accuracy against computational efficiency.
arXiv Detail & Related papers (2025-10-14T09:02:22Z)
From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding [43.82717677801915]
Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks.<n>Their practical use is limited by the "needle in a haystack" problem: the massive number of visual tokens produced from raw video frames exhausts the model's context window.<n>We show that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding.
arXiv Detail & Related papers (2025-10-02T17:43:01Z)
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z)
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension [86.0749609778104]
We propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models.<n>QuoTA strategically allocates frame-level importance scores based on query relevance.<n>We decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring.
arXiv Detail & Related papers (2025-03-11T17:59:57Z)
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs. Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z)
Koala: Key frame-conditioned long video-LLM [70.52369588364992]
We propose a lightweight and self-supervised long video-LLM (Koala) to adapt pretrained vLLMs for generalizing to longer videos. Our approach outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
arXiv Detail & Related papers (2024-04-05T18:33:04Z)
Long Video Understanding with Learnable Retrieval in Video-Language Models [48.3525267216256]
We introduce a learnable retrieval-based video-language model (R-VLM) for efficient long video understanding.<n>Specifically, given a question (Query) and a long video, our model identifies and selects the most relevant K video chunks.<n>This effectively reduces the number of video tokens, eliminates noise interference, and enhances system performance.
arXiv Detail & Related papers (2023-12-08T09:48:36Z)
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.