Related papers: Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

URL: http://arxiv.org/abs/2506.22139v3
Date: Tue, 22 Jul 2025 07:42:31 GMT
Title: Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs
Authors: Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, Jian Luan,
Abstract summary: We introduce video QFrame, a novel approach for adaptive frame selection and multi-temporal scaling.<n>Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP.<n>We demonstrate Q-Frame's effectiveness through extensive experiments on benchmark datasets.
Score: 13.306662159600677
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant success in visual understanding tasks. However, challenges persist in adapting these models for video comprehension due to the large volume of data and temporal complexity. Existing Video-LLMs using uniform frame sampling often struggle to capture the query-related crucial spatiotemporal clues of videos effectively. In this paper, we introduce Q-Frame, a novel approach for adaptive frame selection and multi-resolution scaling tailored to the video's content and the specific query. Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP, utilizing the Gumbel-Max trick for efficient frame selection. Q-Frame allows Video-LLMs to process more frames without exceeding computational limits, thereby preserving critical temporal and spatial information. We demonstrate Q-Frame's effectiveness through extensive experiments on benchmark datasets, including MLVU, LongVideoBench, and Video-MME, illustrating its superiority over existing methods and its applicability across various video understanding tasks.

Related papers

Enhancing Long Video Question Answering with Scene-Localized Frame Grouping [19.83545369186771]
Current Multimodal Large Language Models (MLLMs) often perform poorly in long video understanding.<n>We propose a new scenario under the video question-answering task, SceneQA.<n>We introduce a novel method called SLFG to combine individual frames into semantically coherent scene frames.
arXiv Detail & Related papers (2025-08-05T02:28:58Z)
Moment Sampling in Video LLMs for Long-Form Video QA [22.638644170177013]
"moment sampling" is a model-agnostic approach that enables the model to select the most relevant frames according to the context of the question.<n>By focusing on the frames most pertinent to the given question, our method enhances long-form VideoQA performance in Video LLMs.
arXiv Detail & Related papers (2025-06-18T03:23:56Z)
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding [14.464718780172582]
We introduce HierarQ, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling.<n>We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding.<n>Extensive evaluations on 10 video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ's state-of-the-art performance.
arXiv Detail & Related papers (2025-03-11T16:21:23Z)
M-LLM Based Video Frame Selection for Efficient Video Understanding [60.93714759178143]
We propose a light-weight M-LLM-based frame selection method that adaptively select frames that are more relevant to users' queries.<n>The selected frames are then digested by a frozen downstream video M-LLM for visual reasoning and question answering.
arXiv Detail & Related papers (2025-02-27T01:44:13Z)
Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries [50.47265863322891]
Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos.<n>Recent advancements in Multimodal Large Language Models (MLLMs) have transformed video QA by leveraging their exceptional commonsense reasoning capabilities.<n>We propose T-Former, a novel temporal modeling method that creates a question-guided temporal bridge between frame-wise visual perception and the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-26T17:53:14Z)
VidCtx: Context-aware Video Question Answering with Image Models [15.1350316858766]
We introduce VidCtx, a novel training-free VideoQA framework which integrates both visual information from input frames and textual descriptions of others frames.<n>Experiments show that VidCtx achieves competitive performance among approaches that rely on open models.
arXiv Detail & Related papers (2024-12-23T09:26:38Z)
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z)
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA [40.21221568678641]
Long-form videos that span across wide temporal intervals are highly information redundant.<n>All information necessary to generate a correct response can often be contained within a small subset of frames.<n>Recent literature explore use of large language models in LVQA benchmarks, achieving exceptional performance.
arXiv Detail & Related papers (2024-06-13T17:59:16Z)
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules. Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z)
MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation. Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z)
DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video Summarization [127.16984421969529]
We introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS. DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence. We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
arXiv Detail & Related papers (2021-05-13T17:33:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.