Related papers: Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

URL: http://arxiv.org/abs/2508.03337v2
Date: Wed, 06 Aug 2025 07:41:10 GMT
Title: Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration
Authors: Shaoguang Wang, Jianxiang He, Yijie Xu, Ziyang Chen, Weiyu Guo, Hui Xiong,
Abstract summary: "Less is more" phenomenon where excessive frames can paradoxically degrade performance due to context dilution.<n>"Visual echoes" yield significant temporal redundancy, which we term 'visual echoes'<n>"AFP" employs an adaptive hierarchical clustering algorithm on a fused ResNet-50 and CLIP feature space to identify and merge these echoes into single representatives.<n>Our full approach demonstrates a drastic reduction in required frames by up to 86.9% and total input tokens by up to 83.2%.
Score: 21.69452489173625
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While increasing the number of sampled frames is a common strategy, we observe a "less is more" phenomenon where excessive frames can paradoxically degrade performance due to context dilution. Concurrently, state-of-the-art keyframe selection methods, while effective, still yield significant temporal redundancy, which we term 'visual echoes'. To address these dual challenges, we propose Adaptive Frame-Pruning (AFP), a novel post-processing method that intelligently prunes the selected keyframes. AFP employs an adaptive hierarchical clustering algorithm on a fused ResNet-50 and CLIP feature space to identify and merge these echoes into single representatives. To compensate for information loss, we then introduce a lightweight, text-based semantic graph that provides critical context with minimal token overhead. Conducting extensive experiments on the LongVideoBench and VideoMME benchmarks across multiple leading MLLMs, our full approach demonstrates a drastic reduction in required frames by up to 86.9% and total input tokens by up to 83.2%. Crucially, by providing a concise, high-quality set of frames, our method not only enhances efficiency but often improves accuracy over baselines that use more frames. The code will be released upon publication.

Related papers

E-VRAG: Enhancing Long Video Understanding with Resource-Efficient Retrieval Augmented Generation [8.441615871480858]
We propose E-VRAG, a novel and efficient video RAG framework for video understanding.<n>We first apply a frame pre-filtering method based on hierarchical query decomposition to eliminate irrelevant frames.<n>We then employ a lightweight VLM for frame scoring, further reducing computational costs at the model level.
arXiv Detail & Related papers (2025-08-03T02:09:54Z)
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs [13.306662159600677]
We introduce video QFrame, a novel approach for adaptive frame selection and multi-temporal scaling.<n>Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP.<n>We demonstrate Q-Frame's effectiveness through extensive experiments on benchmark datasets.
arXiv Detail & Related papers (2025-06-27T11:30:51Z)
Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders [62.58375366359421]
Multimodal Large Language Models (MLLMs) for long video understanding remains a challenging problem.<n>Traditional uniform sampling leads to selection of irrelevant content.<n>Post-training MLLMs on thousands of frames imposes a substantial computational burden.<n>We propose threadings with narratives (Nar-KFC) to facilitate effective and efficient long video perception.
arXiv Detail & Related papers (2025-05-30T03:04:28Z)
PMQ-VE: Progressive Multi-Frame Quantization for Video Enhancement [83.89668902758243]
Multi-frame video enhancement tasks aim to improve the spatial and temporal resolution and quality of video sequences.<n>We propose Progressive Multi-Frame Quantization for Video Enhancement (PMQ-VE)<n>This framework features a coarse-to-fine two-stage process: Backtracking-based Multi-Frame Quantization (BMFQ) and Progressive Multi-Teacher Distillation (PMTD)
arXiv Detail & Related papers (2025-05-18T07:10:40Z)
FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding [17.71123451197036]
complexity of video data and contextual processing limitations still hinder long-video comprehension.<n>We propose FiLA-Video, a novel framework that integrates multiple frames into a single representation.<n>FiLA-Video achieves superior efficiency and accuracy in long-video comprehension compared to existing methods.
arXiv Detail & Related papers (2025-04-29T03:09:46Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Adaptive Keyframe Sampling for Long Video Understanding [75.7837692594814]
This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS)<n>It inserts a plug-and-play module known as Adaptive Keyframe Sampling (AKS) which aims to maximize the useful information with a fixed number of video tokens.<n>Experiments on two long video understanding benchmarks validate that AKS improves video QA accuracy upon selecting informative encounters.
arXiv Detail & Related papers (2025-02-28T17:46:29Z)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.<n>We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models [41.12711820047315]
Video understanding models usually randomly sample a set of frames or clips, regardless of internal correlations between their visual contents, nor their relevance to the problem. We propose two frame sampling strategies, namely the most domain frames (MDF) and most implied frames (MIF), to maximally preserve those frames that are most likely vital to the given questions.
arXiv Detail & Related papers (2023-07-09T14:54:30Z)
All at Once: Temporally Adaptive Multi-Frame Interpolation with Advanced Motion Modeling [52.425236515695914]
State-of-the-art methods are iterative solutions interpolating one frame at the time. This work introduces a true multi-frame interpolator. It utilizes a pyramidal style network in the temporal domain to complete the multi-frame task in one-shot.
arXiv Detail & Related papers (2020-07-23T02:34:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.