Related papers: VideoLLM: Modeling Video Sequence with Large Language Models

VideoLLM: Modeling Video Sequence with Large Language Models

URL: http://arxiv.org/abs/2305.13292v2
Date: Tue, 23 May 2023 07:48:15 GMT
Title: VideoLLM: Modeling Video Sequence with Large Language Models
Authors: Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, Limin Wang
Abstract summary: Existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks. We propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs. VideoLLM incorporates a carefully designed Modality and Semantic Translator, which convert inputs from various modalities into a unified token sequence.
Score: 70.32832021713864
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: With the exponential growth of video data, there is an urgent need for automated technology to analyze and comprehend video content. However, existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks. The success of large language models (LLMs) like GPT has demonstrated their impressive abilities in sequence causal reasoning. Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding. VideoLLM incorporates a carefully designed Modality Encoder and Semantic Translator, which convert inputs from various modalities into a unified token sequence. This token sequence is then fed into a decoder-only LLM. Subsequently, with the aid of a simple task head, our VideoLLM yields an effective unified framework for different kinds of video understanding tasks. To evaluate the efficacy of VideoLLM, we conduct extensive experiments using multiple LLMs and fine-tuning methods. We evaluate our VideoLLM on eight tasks sourced from four different datasets. The experimental results demonstrate that the understanding and reasoning capabilities of LLMs can be effectively transferred to video understanding tasks. We release the code at https://github.com/cg1177/VideoLLM.

Related papers

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning [29.811030252357195]
multimodal large language models (MLLMs) are crucial for downstream tasks like video question answering and temporal grounding.<n>We propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework.
arXiv Detail & Related papers (2025-08-06T13:03:21Z)
SiLVR: A Simple Language-based Video Reasoning Framework [71.77141065418238]
We present SiLVR, a Simple Language-based Video Reasoning framework.<n>In the first stage, SiLVR transforms raw video into language-based representations using multisensory inputs.<n>In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks.
arXiv Detail & Related papers (2025-05-30T17:59:19Z)
TRACE: Temporal Grounding Video LLM via Causal Event Modeling [6.596327795743185]
Video Temporal Grounding (VTG) is a crucial capability for video understanding models and plays a vital role in downstream tasks such as video browsing and editing. Current video LLMs rely exclusively on natural language generation, lacking the ability to model the clear structure inherent in videos. This paper introduces causal event modeling framework, which represents video LLM outputs as sequences of events, and predict the current event using previous events, video inputs, and textural instructions. We propose a novel task-interleaved video LLM called TRACE to effectively implement the causal event modeling framework in practice.
arXiv Detail & Related papers (2024-10-08T02:46:30Z)
From Image to Video, what do we need in multimodal LLMs? [17.847011311716077]
This paper introduces RED-VILLM, a Resource-Efficient Development pipeline which builds robust Video LLMs.<n>We devise a temporal adaptation plug-and-play structure, endowing the backbone Image LLM with the capability to grasp temporal information.<n>Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models.
arXiv Detail & Related papers (2024-04-18T02:43:37Z)
ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation. How to effectively encode and understand videos in video-based dialogue systems remains to be solved. We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z)
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding [65.12464615430036]
This paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of Large Language Models (LLMs) Ours is a novel approach to extend the utility of LLMs in the context of video tasks. We harness their contextual learning capabilities to generate executable visual programs for video understanding.
arXiv Detail & Related papers (2024-03-21T18:00:00Z)
Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding. The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning. This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z)
Retrieval-based Video Language Model for Efficient Long Video Question Answering [39.474247695753725]
We introduce a retrieval-based video language model (R-VLM) for efficient and interpretable long video QA. Specifically, given a question (query) and a long video, our model identifies and selects the most relevant $K$ video chunks. Our experimental results validate the effectiveness of our framework for comprehending long videos.
arXiv Detail & Related papers (2023-12-08T09:48:36Z)
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench. We first introduce a novel static-to-dynamic method to define these temporal-related tasks. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z)
Frozen Transformers in Language Models Are Effective Visual Encoder Layers [26.759544759745648]
Large language models (LLMs) are surprisingly strong encoders for purely visual tasks in the absence of language. Our work pushes the boundaries of leveraging LLMs for computer vision tasks. We propose the information filtering hypothesis to explain the effectiveness of pre-trained LLMs in visual encoding.
arXiv Detail & Related papers (2023-10-19T17:59:05Z)
Valley: Video Assistant with Large Language model Enhanced abilitY [41.79449203718827]
We introduce Valley, a Video Assistant with Large Language model Enhanced abilitY. To empower Valley with video comprehension and instruction-following capabilities, we construct a video instruction dataset. We employ ChatGPT to facilitate the construction of task-oriented conversation data.
arXiv Detail & Related papers (2023-06-12T16:11:10Z)
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video. Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding. To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z)
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling [102.42424022921243]
Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks. Experiments show that this unified framework achieves competitive performance on 14 VidL benchmarks.
arXiv Detail & Related papers (2022-06-14T20:43:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.