Related papers: How Important are Videos for Training Video LLMs?

How Important are Videos for Training Video LLMs?

URL: http://arxiv.org/abs/2506.06928v1
Date: Sat, 07 Jun 2025 21:32:19 GMT
Title: How Important are Videos for Training Video LLMs?
Authors: George Lydakis, Alexander Hermans, Ali Athar, Daan de Geus, Bastian Leibe,
Abstract summary: We present findings indicating Video LLMs are more capable of temporal reasoning after image-only training than one would assume.<n>We introduce a simple finetuning scheme involving sequences of annotated images and questions targeting temporal capabilities.<n>This suggests suboptimal utilization of rich temporal features found in real video by current models.
Score: 55.965474658745315
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Research into Video Large Language Models (LLMs) has progressed rapidly, with numerous models and benchmarks emerging in just a few years. Typically, these models are initialized with a pretrained text-only LLM and finetuned on both image- and video-caption datasets. In this paper, we present findings indicating that Video LLMs are more capable of temporal reasoning after image-only training than one would assume, and that improvements from video-specific training are surprisingly small. Specifically, we show that image-trained versions of two LLMs trained with the recent LongVU algorithm perform significantly above chance level on TVBench, a temporal reasoning benchmark. Additionally, we introduce a simple finetuning scheme involving sequences of annotated images and questions targeting temporal capabilities. This baseline results in temporal reasoning performance close to, and occasionally higher than, what is achieved by video-trained LLMs. This suggests suboptimal utilization of rich temporal features found in real video by current models. Our analysis motivates further research into the mechanisms that allow image-trained LLMs to perform temporal reasoning, as well as into the bottlenecks that render current video training schemes inefficient.

Related papers

TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding [26.463523465270097]
Multi-language Large Language Models (MLLMs) have demonstrated significant progress in vision-based tasks, yet they still face challenges when processing long-duration video inputs.<n>We propose Temporal Policy Sampling Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning.<n>Our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.
arXiv Detail & Related papers (2025-08-06T12:03:36Z)
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency [56.475612147721264]
We propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals.<n>We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA.<n>Results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs.
arXiv Detail & Related papers (2025-06-02T17:28:26Z)
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding [55.32878803528196]
Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos.<n>We propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding.<n>To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning.
arXiv Detail & Related papers (2025-05-27T04:50:07Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs. Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z)
From Image to Video, what do we need in multimodal LLMs? [17.847011311716077]
This paper introduces RED-VILLM, a Resource-Efficient Development pipeline which builds robust Video LLMs.<n>We devise a temporal adaptation plug-and-play structure, endowing the backbone Image LLM with the capability to grasp temporal information.<n>Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models.
arXiv Detail & Related papers (2024-04-18T02:43:37Z)
ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation. How to effectively encode and understand videos in video-based dialogue systems remains to be solved. We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z)
LLM4VG: Large Language Models Evaluation for Video Grounding [39.40610479454726]
This paper systematically evaluates the performance of different LLMs on video grounding tasks. We propose prompt methods to integrate the instruction of VG and description from different kinds of generators. Our experimental evaluations lead to two conclusions: (i) the existing VidLLMs are still far away from achieving satisfactory video grounding performance, and more time-related video tasks should be included to further fine-tune these models.
arXiv Detail & Related papers (2023-12-21T08:15:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.