VTimeLLM: Empower LLM to Grasp Video Moments
- URL: http://arxiv.org/abs/2311.18445v1
- Date: Thu, 30 Nov 2023 10:49:56 GMT
- Title: VTimeLLM: Empower LLM to Grasp Video Moments
- Authors: Bin Huang, Xin Wang, Hong Chen, Zihan Song, Wenwu Zhu
- Abstract summary: Large language models (LLMs) have shown remarkable text understanding capabilities.
Video LLMs can only provide a coarse description of the entire video.
We propose VTimeLLM, a novel Video LLM for fine-grained video moment understanding.
- Score: 43.51980030572101
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) have shown remarkable text understanding
capabilities, which have been extended as Video LLMs to handle video data for
comprehending visual details. However, existing Video LLMs can only provide a
coarse description of the entire video, failing to capture the precise start
and end time boundary of specific events. In this paper, we solve this issue
via proposing VTimeLLM, a novel Video LLM designed for fine-grained video
moment understanding and reasoning with respect to time boundary. Specifically,
our VTimeLLM adopts a boundary-aware three-stage training strategy, which
respectively utilizes image-text pairs for feature alignment, multiple-event
videos to increase temporal-boundary awareness, and high-quality
video-instruction tuning to further improve temporal understanding ability as
well as align with human intents. Extensive experiments demonstrate that in
fine-grained time-related comprehension tasks for videos such as Temporal Video
Grounding and Dense Video Captioning, VTimeLLM significantly outperforms
existing Video LLMs. Besides, benefits from the fine-grained temporal
understanding of the videos further enable VTimeLLM to beat existing Video LLMs
in video dialogue benchmark, showing its superior cross-modal understanding and
reasoning abilities.
Related papers
- MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - HawkEye: Training Video-Text LLMs for Grounding Text in Videos [44.870165050047355]
We propose HawkEye, one of the first video-text LLMs that can perform temporal video grounding in a fully text-to-text manner.
To collect training data that is applicable for temporal video grounding, we construct InternVid-G, a large-scale video-text corpus with segment-level captions and negative spans.
We also propose a coarse-grained method of representing segments in videos, which is more robust and easier for LLMs to learn and follow than other alternatives.
arXiv Detail & Related papers (2024-03-15T11:58:18Z) - LLMs Meet Long Video: Advancing Long Video Comprehension with An
Interactive Visual Adapter in LLMs [24.79384819644494]
Long video understanding is a significant and ongoing challenge in the intersection of multimedia and artificial intelligence.
We present an Interactive Visual Adapter (IVA) within large language models (LLMs) to enhance interaction with fine-grained visual elements.
arXiv Detail & Related papers (2024-02-21T05:56:52Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.
The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.
This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z) - VideoLLM: Modeling Video Sequence with Large Language Models [70.32832021713864]
Existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks.
We propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs.
VideoLLM incorporates a carefully designed Modality and Semantic Translator, which convert inputs from various modalities into a unified token sequence.
arXiv Detail & Related papers (2023-05-22T17:51:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.