Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
- URL: http://arxiv.org/abs/2503.00540v1
- Date: Sat, 01 Mar 2025 15:53:33 GMT
- Title: Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
- Authors: Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, Hao Jiang,
- Abstract summary: We propose ReKV, a training-free approach that enables efficient streaming video question-answering (StreamingVQA)<n>Our approach analyzes long videos in a streaming manner, allowing for prompt responses as soon as user queries are received.
- Score: 10.990431921021585
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose ReKV, a novel training-free approach that enables efficient streaming video question-answering (StreamingVQA), by seamlessly integrating with existing Video Large Language Models (Video-LLMs). Traditional VideoQA systems struggle with long videos, as they must process entire videos before responding to queries, and repeat this process for each new question. In contrast, our approach analyzes long videos in a streaming manner, allowing for prompt responses as soon as user queries are received. Building on a common Video-LLM, we first incorporate a sliding-window attention mechanism, ensuring that input frames attend to a limited number of preceding frames, thereby reducing computational overhead. To prevent information loss, we store processed video key-value caches (KV-Caches) in RAM and disk, reloading them into GPU memory as needed. Additionally, we introduce a retrieval method that leverages an external retriever or the parameters within Video-LLMs to retrieve only query-relevant KV-Caches, ensuring both efficiency and accuracy in question answering. ReKV enables the separation of video encoding and question-answering across different processes and GPUs, significantly enhancing the efficiency of StreamingVQA. Through comprehensive experimentation, we validate the efficacy and practicality of our approach, which significantly boosts efficiency and enhances applicability over existing VideoQA models.
Related papers
- Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory [5.311777874655448]
Long-Video Memory Network, Long-VMNet, is a novel video understanding method.
Long-VMNet achieves improved efficiency by leveraging a neural sampler that identifies discriminative tokens.
Our results on the Rest-ADL dataset demonstrate an 18x -- 75x improvement in inference times for long-form video retrieval and answering questions.
arXiv Detail & Related papers (2025-03-17T20:25:41Z) - Adaptive Caching for Faster Video Generation with Diffusion Transformers [52.73348147077075]
Diffusion Transformers (DiTs) rely on larger models and heavier attention mechanisms, resulting in slower inference speeds.
We introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache)
We also introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, controlling the compute allocation based on motion content.
arXiv Detail & Related papers (2024-11-04T18:59:44Z) - Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams [78.72965584414368]
We present Flash-VStream, a video-language model that simulates the memory mechanism of human.
Compared to existing models, Flash-VStream achieves significant reductions in latency inference and VRAM consumption.
We propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding.
arXiv Detail & Related papers (2024-06-12T11:07:55Z) - Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding.
It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected.
Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - Speeding Up Action Recognition Using Dynamic Accumulation of Residuals
in Compressed Domain [2.062593640149623]
Temporal redundancy and the sheer size of raw videos are the two most common problematic issues related to video processing algorithms.
This paper presents an approach for using residual data, available in compressed videos directly, which can be obtained by a light partially decoding procedure.
Applying neural networks exclusively for accumulated residuals in the compressed domain accelerates performance, while the classification results are highly competitive with raw video approaches.
arXiv Detail & Related papers (2022-09-29T13:08:49Z) - Self-Conditioned Probabilistic Learning of Video Rescaling [70.10092286301997]
We propose a self-conditioned probabilistic framework for video rescaling to learn the paired downscaling and upscaling procedures simultaneously.
We decrease the entropy of the information lost in the downscaling by maximizing its conditioned probability on the strong spatial-temporal prior information.
We extend the framework to a lossy video compression system, in which a gradient estimator for non-differential industrial lossy codecs is proposed.
arXiv Detail & Related papers (2021-07-24T15:57:15Z) - Self-supervised Video Retrieval Transformer Network [10.456881328982586]
We propose SVRTN, which applies self-supervised training to learn video representation from unlabeled data.
It exploits transformer structure to aggregate frame-level features into clip-level to reduce both storage space and search complexity.
It can learn the complementary and discriminative information from the interactions among clip frames, as well as acquire the frame permutation and missing invariant ability to support more flexible retrieval manners.
arXiv Detail & Related papers (2021-04-16T09:43:45Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.