Adaptive Keyframe Sampling for Long Video Understanding
- URL: http://arxiv.org/abs/2502.21271v1
- Date: Fri, 28 Feb 2025 17:46:29 GMT
- Title: Adaptive Keyframe Sampling for Long Video Understanding
- Authors: Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, Qixiang Ye,
- Abstract summary: This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS)<n>It inserts a plug-and-play module known as Adaptive Keyframe Sampling (AKS) which aims to maximize the useful information with a fixed number of video tokens.<n>Experiments on two long video understanding benchmarks validate that AKS improves video QA accuracy upon selecting informative encounters.
- Score: 75.7837692594814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal large language models (MLLMs) have enabled open-world visual understanding by injecting visual input as extra tokens into large language models (LLMs) as contexts. However, when the visual input changes from a single image to a long video, the above paradigm encounters difficulty because the vast amount of video tokens has significantly exceeded the maximal capacity of MLLMs. Therefore, existing video-based MLLMs are mostly established upon sampling a small portion of tokens from input data, which can cause key information to be lost and thus produce incorrect answers. This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS). It inserts a plug-and-play module known as keyframe selection, which aims to maximize the useful information with a fixed number of video tokens. We formulate keyframe selection as an optimization involving (1) the relevance between the keyframes and the prompt, and (2) the coverage of the keyframes over the video, and present an adaptive algorithm to approximate the best solution. Experiments on two long video understanding benchmarks validate that Adaptive Keyframe Sampling improves video QA accuracy (beyond strong baselines) upon selecting informative keyframes. Our study reveals the importance of information pre-filtering in video-based MLLMs. Code is available at https://github.com/ncTimTang/AKS.
Related papers
- FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding [17.71123451197036]
complexity of video data and contextual processing limitations still hinder long-video comprehension.
We propose FiLA-Video, a novel framework that integrates multiple frames into a single representation.
FiLA-Video achieves superior efficiency and accuracy in long-video comprehension compared to existing methods.
arXiv Detail & Related papers (2025-04-29T03:09:46Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.
We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - M-LLM Based Video Frame Selection for Efficient Video Understanding [60.93714759178143]
We propose a light-weight M-LLM-based frame selection method that adaptively select frames that are more relevant to users' queries.<n>The selected frames are then digested by a frozen downstream video M-LLM for visual reasoning and question answering.
arXiv Detail & Related papers (2025-02-27T01:44:13Z) - The Devil is in Temporal Token: High Quality Video Reasoning Segmentation [68.33080352141653]
Methods for Video Reasoning rely heavily on a single special token to represent the object in the video.<n>We propose VRS-HQ, an end-to-end video reasoning segmentation approach.<n>Our results highlight the strong temporal reasoning and segmentation capabilities of our method.
arXiv Detail & Related papers (2025-01-15T03:17:24Z) - Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering [11.244643114253773]
Video Question (VideoQA) aims to answer natural language questions based on the information observed in videos.
We propose a novel weakly supervised framework to enforce the LMMs to reason out the answers with question-critical moments as visual inputs.
arXiv Detail & Related papers (2024-01-19T14:21:46Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.<n>We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models [41.12711820047315]
Video understanding models usually randomly sample a set of frames or clips, regardless of internal correlations between their visual contents, nor their relevance to the problem.
We propose two frame sampling strategies, namely the most domain frames (MDF) and most implied frames (MIF), to maximally preserve those frames that are most likely vital to the given questions.
arXiv Detail & Related papers (2023-07-09T14:54:30Z) - Key Frame Extraction with Attention Based Deep Neural Networks [0.0]
We propose a deep learning-based approach for detection using a deep auto-encoder model with an attention layer.
The proposed method first extracts the features from the video frames using the encoder part of the autoencoder and applies segmentation using the k-means algorithm to group these features and similar frames together.
The method was evaluated on the TVSUM clustering video dataset and achieved a classification accuracy of 0.77, indicating a higher success rate than many existing methods.
arXiv Detail & Related papers (2023-06-21T15:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.