FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding
- URL: http://arxiv.org/abs/2504.17447v1
- Date: Thu, 24 Apr 2025 11:19:18 GMT
- Title: FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding
- Authors: De-An Huang, Subhashree Radhakrishnan, Zhiding Yu, Jan Kautz,
- Abstract summary: We propose Frame Selection Augmented Generation (FRAG) to process long inputs without long context LMMs.<n>The core of the selection process is done by scoring each frame independently, which does not require long context processing.<n>We show that FRAG consistently improves the performance and achieves state-of-the-art performances for both long video and long document understanding.
- Score: 70.56829394569938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There has been impressive progress in Large Multimodal Models (LMMs). Recent works extend these models to long inputs, including multi-page documents and long videos. However, the model size and performance of these long context models are still limited due to the computational cost in both training and inference. In this work, we explore an orthogonal direction and process long inputs without long context LMMs. We propose Frame Selection Augmented Generation (FRAG), where the model first selects relevant frames within the input, and then only generates the final outputs based on the selected frames. The core of the selection process is done by scoring each frame independently, which does not require long context processing. The frames with the highest scores are then selected by a simple Top-K selection. We show that this frustratingly simple framework is applicable to both long videos and multi-page documents using existing LMMs without any fine-tuning. We consider two models, LLaVA-OneVision and InternVL2, in our experiments and show that FRAG consistently improves the performance and achieves state-of-the-art performances for both long video and long document understanding. For videos, FRAG substantially improves InternVL2-76B by 5.8% on MLVU and 3.7% on Video-MME. For documents, FRAG achieves over 20% improvements on MP-DocVQA compared with recent LMMs specialized in long document understanding. Code is available at: https://github.com/NVlabs/FRAG
Related papers
- BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding [51.49345400300556]
Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks.
Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content.
We introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies.
arXiv Detail & Related papers (2025-03-27T13:18:40Z) - M-LLM Based Video Frame Selection for Efficient Video Understanding [60.93714759178143]
We propose a light-weight M-LLM-based frame selection method that adaptively select frames that are more relevant to users' queries.<n>The selected frames are then digested by a frozen downstream video M-LLM for visual reasoning and question answering.
arXiv Detail & Related papers (2025-02-27T01:44:13Z) - Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework.<n>Our approach significantly reduces the memory footprint compared to standard gradient checkpointing.<n>By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z) - VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges [42.555895949250704]
VideoLLaMB is a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences.
SceneTilling algorithm segments videos into independent semantic units to preserve semantic integrity.
In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU.
arXiv Detail & Related papers (2024-09-02T08:52:58Z) - Long Context Transfer from Language to Vision [74.78422371545716]
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos.
In this paper, we approach this problem from the perspective of the language model.
By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training.
arXiv Detail & Related papers (2024-06-24T17:58:06Z) - Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA [40.21221568678641]
Long-form videos that span across wide temporal intervals are highly information redundant.<n>All information necessary to generate a correct response can often be contained within a small subset of frames.<n>Recent literature explore use of large language models in LVQA benchmarks, achieving exceptional performance.
arXiv Detail & Related papers (2024-06-13T17:59:16Z) - MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding [66.56100008577134]
This study focuses on designing an efficient and effective model for long-term video understanding.
We propose to process videos in an online manner and store past video information in a memory bank.
Our model can achieve state-of-the-art performances across multiple datasets.
arXiv Detail & Related papers (2024-04-08T17:59:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.