VideoLLM: Modeling Video Sequence with Large Language Models
- URL: http://arxiv.org/abs/2305.13292v2
- Date: Tue, 23 May 2023 07:48:15 GMT
- Title: VideoLLM: Modeling Video Sequence with Large Language Models
- Authors: Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting
Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, Limin Wang
- Abstract summary: Existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks.
We propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs.
VideoLLM incorporates a carefully designed Modality and Semantic Translator, which convert inputs from various modalities into a unified token sequence.
- Score: 70.32832021713864
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: With the exponential growth of video data, there is an urgent need for
automated technology to analyze and comprehend video content. However, existing
video understanding models are often task-specific and lack a comprehensive
capability of handling diverse tasks. The success of large language models
(LLMs) like GPT has demonstrated their impressive abilities in sequence causal
reasoning. Building upon this insight, we propose a novel framework called
VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs
from natural language processing (NLP) for video sequence understanding.
VideoLLM incorporates a carefully designed Modality Encoder and Semantic
Translator, which convert inputs from various modalities into a unified token
sequence. This token sequence is then fed into a decoder-only LLM.
Subsequently, with the aid of a simple task head, our VideoLLM yields an
effective unified framework for different kinds of video understanding tasks.
To evaluate the efficacy of VideoLLM, we conduct extensive experiments using
multiple LLMs and fine-tuning methods. We evaluate our VideoLLM on eight tasks
sourced from four different datasets. The experimental results demonstrate that
the understanding and reasoning capabilities of LLMs can be effectively
transferred to video understanding tasks. We release the code at
https://github.com/cg1177/VideoLLM.
Related papers
- TRACE: Temporal Grounding Video LLM via Causal Event Modeling [6.596327795743185]
Video Temporal Grounding (VTG) is a crucial capability for video understanding models and plays a vital role in downstream tasks such as video browsing and editing.
Current video LLMs rely exclusively on natural language generation, lacking the ability to model the clear structure inherent in videos.
This paper introduces causal event modeling framework, which represents video LLM outputs as sequences of events, and predict the current event using previous events, video inputs, and textural instructions.
We propose a novel task-interleaved video LLM called TRACE to effectively implement the causal event modeling framework in practice.
arXiv Detail & Related papers (2024-10-08T02:46:30Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding [65.12464615430036]
This paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of Large Language Models (LLMs)
Ours is a novel approach to extend the utility of LLMs in the context of video tasks.
We harness their contextual learning capabilities to generate executable visual programs for video understanding.
arXiv Detail & Related papers (2024-03-21T18:00:00Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.
The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.
This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z) - Retrieval-based Video Language Model for Efficient Long Video Question
Answering [39.474247695753725]
We introduce a retrieval-based video language model (R-VLM) for efficient and interpretable long video QA.
Specifically, given a question (query) and a long video, our model identifies and selects the most relevant $K$ video chunks.
Our experimental results validate the effectiveness of our framework for comprehending long videos.
arXiv Detail & Related papers (2023-12-08T09:48:36Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z) - Frozen Transformers in Language Models Are Effective Visual Encoder Layers [26.759544759745648]
Large language models (LLMs) are surprisingly strong encoders for purely visual tasks in the absence of language.
Our work pushes the boundaries of leveraging LLMs for computer vision tasks.
We propose the information filtering hypothesis to explain the effectiveness of pre-trained LLMs in visual encoding.
arXiv Detail & Related papers (2023-10-19T17:59:05Z) - Valley: Video Assistant with Large Language model Enhanced abilitY [41.79449203718827]
We introduce Valley, a Video Assistant with Large Language model Enhanced abilitY.
To empower Valley with video comprehension and instruction-following capabilities, we construct a video instruction dataset.
We employ ChatGPT to facilitate the construction of task-oriented conversation data.
arXiv Detail & Related papers (2023-06-12T16:11:10Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - LAVENDER: Unifying Video-Language Understanding as Masked Language
Modeling [102.42424022921243]
Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks.
Experiments show that this unified framework achieves competitive performance on 14 VidL benchmarks.
arXiv Detail & Related papers (2022-06-14T20:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.