Related papers: Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

URL: http://arxiv.org/abs/2410.10441v2
Date: Wed, 16 Oct 2024 09:45:06 GMT
Title: Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs
Authors: Kai Han, Jianyuan Guo, Yehui Tang, Wei He, Enhua Wu, Yunhe Wang,
Abstract summary: We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs. Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
Score: 56.040198387038025
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language large models have achieved remarkable success in various multi-modal tasks, yet applying them to video understanding remains challenging due to the inherent complexity and computational demands of video data. While training-based video-LLMs deliver high performance, they often require substantial resources for training and inference. Conversely, training-free approaches offer a more efficient alternative by adapting pre-trained image-LLMs models for video tasks without additional training, but they face inference efficiency bottlenecks due to the large number of visual tokens generated from video frames. In this work, we present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs. The proposed framework decouples spatial-temporal dimension and performs temporal frame sampling and spatial RoI cropping respectively based on task-specific prompts. Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks. Extensive experiments demonstrate that our approach achieves competitive results with significantly fewer tokens, offering an optimal trade-off between accuracy and computational efficiency compared to state-of-the-art video LLMs. The code will be available at https://github.com/contrastive/FreeVideoLLM.

Related papers

FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding [55.700832127331324]
FLoC is an efficient visual token compression framework based on the facility location function.<n>Our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens.<n>Our approach is training-free, model-agnostic, and query-agnostic, providing a versatile solution.
arXiv Detail & Related papers (2025-10-31T17:29:39Z)
An Empirical Study for Representations of Videos in Video Question Answering via MLLMs [4.726627693005334]
Multimodal large language models have recently achieved remarkable progress in video question answering.<n>It remains unclear which video representations are most effective for MLLMs, and how different modalities balance task accuracy against computational efficiency.
arXiv Detail & Related papers (2025-10-14T09:02:22Z)
TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding [26.463523465270097]
Multi-language Large Language Models (MLLMs) have demonstrated significant progress in vision-based tasks, yet they still face challenges when processing long-duration video inputs.<n>We propose Temporal Policy Sampling Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning.<n>Our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.
arXiv Detail & Related papers (2025-08-06T12:03:36Z)
How Important are Videos for Training Video LLMs? [55.965474658745315]
We present findings indicating Video LLMs are more capable of temporal reasoning after image-only training than one would assume.<n>We introduce a simple finetuning scheme involving sequences of annotated images and questions targeting temporal capabilities.<n>This suggests suboptimal utilization of rich temporal features found in real video by current models.
arXiv Detail & Related papers (2025-06-07T21:32:19Z)
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency [56.475612147721264]
We propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals.<n>We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA.<n>Results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs.
arXiv Detail & Related papers (2025-06-02T17:28:26Z)
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs. We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
PruneVid: Visual Token Pruning for Efficient Video Large Language Models [24.889834611542955]
We introduce PruneVid, a visual token pruning method designed to enhance the efficiency of multi-modal video understanding. LLMs have shown promising performance in video tasks due to their extended capabilities in comprehending visual modalities. We validate our method across multiple video benchmarks, which demonstrate that PruneVid can prune over 80% of tokens.
arXiv Detail & Related papers (2024-12-20T18:01:58Z)
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models [52.590072198551944]
Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents. For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data. In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM.
arXiv Detail & Related papers (2024-11-17T13:08:29Z)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z)
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models [29.825619120260164]
This paper addresses the challenge by leveraging the visual commonalities between images and videos to evolve image-LVLMs into video-LVLMs. We present a cost-effective video-LVLM that enhances model architecture, introduces innovative training strategies, and identifies the most effective types of video instruction data.
arXiv Detail & Related papers (2024-06-12T09:22:45Z)
ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation. How to effectively encode and understand videos in video-based dialogue systems remains to be solved. We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning [15.998149438353133]
We propose a two-stage retrieval architecture for text-to-video retrieval. In training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations.
arXiv Detail & Related papers (2024-01-01T08:54:18Z)
Long Video Understanding with Learnable Retrieval in Video-Language Models [36.793956806567834]
We introduce a learnable retrieval-based video-language model (R-VLM) for efficient long video understanding. Specifically, given a question (Query) and a long video, our model identifies and selects the most relevant K video chunks. This effectively reduces the number of video tokens, eliminates noise interference, and enhances system performance.
arXiv Detail & Related papers (2023-12-08T09:48:36Z)
VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools. An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z)
Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm. Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks. We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.