VTimeLLM: Empower LLM to Grasp Video Moments
- URL: http://arxiv.org/abs/2311.18445v1
- Date: Thu, 30 Nov 2023 10:49:56 GMT
- Title: VTimeLLM: Empower LLM to Grasp Video Moments
- Authors: Bin Huang, Xin Wang, Hong Chen, Zihan Song, Wenwu Zhu
- Abstract summary: Large language models (LLMs) have shown remarkable text understanding capabilities.
Video LLMs can only provide a coarse description of the entire video.
We propose VTimeLLM, a novel Video LLM for fine-grained video moment understanding.
- Score: 43.51980030572101
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) have shown remarkable text understanding
capabilities, which have been extended as Video LLMs to handle video data for
comprehending visual details. However, existing Video LLMs can only provide a
coarse description of the entire video, failing to capture the precise start
and end time boundary of specific events. In this paper, we solve this issue
via proposing VTimeLLM, a novel Video LLM designed for fine-grained video
moment understanding and reasoning with respect to time boundary. Specifically,
our VTimeLLM adopts a boundary-aware three-stage training strategy, which
respectively utilizes image-text pairs for feature alignment, multiple-event
videos to increase temporal-boundary awareness, and high-quality
video-instruction tuning to further improve temporal understanding ability as
well as align with human intents. Extensive experiments demonstrate that in
fine-grained time-related comprehension tasks for videos such as Temporal Video
Grounding and Dense Video Captioning, VTimeLLM significantly outperforms
existing Video LLMs. Besides, benefits from the fine-grained temporal
understanding of the videos further enable VTimeLLM to beat existing Video LLMs
in video dialogue benchmark, showing its superior cross-modal understanding and
reasoning abilities.
Related papers
- TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning [42.928144657587325]
This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding.
TimeSuite provides a successful solution to enhance the long video understanding capability of short-form MLLM.
In addition, we introduce the TimePro, a comprehensive grounding-centric instruction dataset composed of 9 tasks and 349k high-quality grounded annotations.
arXiv Detail & Related papers (2024-10-25T17:19:55Z) - Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models [53.235170710385006]
We introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner.
We sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge.
In experiments, Grounded-VideoLLM excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA.
arXiv Detail & Related papers (2024-10-04T10:04:37Z) - Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner [53.671484175063995]
Video-LLMs are pre-trained to process short videos, limiting their broader application for understanding longer video content.
We introduce an alternative video token rearrangement technique that circumvents limitations imposed by the fixed video encoder and alignment projector.
arXiv Detail & Related papers (2024-09-19T17:59:55Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - HawkEye: Training Video-Text LLMs for Grounding Text in Videos [44.870165050047355]
We propose HawkEye, one of the first video-text LLMs that can perform temporal video grounding in a fully text-to-text manner.
To collect training data that is applicable for temporal video grounding, we construct InternVid-G, a large-scale video-text corpus with segment-level captions and negative spans.
We also propose a coarse-grained method of representing segments in videos, which is more robust and easier for LLMs to learn and follow than other alternatives.
arXiv Detail & Related papers (2024-03-15T11:58:18Z) - LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs [22.696090318037925]
Long video understanding is a significant and ongoing challenge in the intersection of multimedia and artificial intelligence.
We present an Interactive Visual Adapter (IVA) within large language models (LLMs) to enhance interaction with fine-grained visual elements.
arXiv Detail & Related papers (2024-02-21T05:56:52Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.
The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.
This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.