Vript: A Video Is Worth Thousands of Words
- URL: http://arxiv.org/abs/2406.06040v1
- Date: Mon, 10 Jun 2024 06:17:55 GMT
- Title: Vript: A Video Is Worth Thousands of Words
- Authors: Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, Hai Zhao,
- Abstract summary: Vript is an annotated corpus of 12K high-resolution videos with detailed, dense, and script-like captions.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
- Score: 54.815686588378156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advancements in multimodal learning, particularly in video understanding and generation, require high-quality video-text datasets for improved model performance. Vript addresses this issue with a meticulously annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips. Each clip has a caption of ~145 words, which is over 10x longer than most video-text datasets. Unlike captions only documenting static content in previous datasets, we enhance video captioning to video scripting by documenting not just the content, but also the camera operations, which include the shot types (medium shot, close-up, etc) and camera movements (panning, tilting, etc). By utilizing the Vript, we explore three training paradigms of aligning more text with the video modality rather than clip-caption pairs. This results in Vriptor, a top-performing video captioning model among open-source models, comparable to GPT-4V in performance. Vriptor is also a powerful model capable of end-to-end generation of dense and detailed captions for long videos. Moreover, we introduce Vript-Hard, a benchmark consisting of three video understanding tasks that are more challenging than existing benchmarks: Vript-HAL is the first benchmark evaluating action and object hallucinations in video LLMs, Vript-RR combines reasoning with retrieval resolving question ambiguity in long-video QAs, and Vript-ERO is a new task to evaluate the temporal understanding of events in long videos rather than actions in short videos in previous works. All code, models, and datasets are available in https://github.com/mutonix/Vript.
Related papers
- PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance [44.08446730529495]
We propose a novel pooling strategy that simultaneously achieves token compression and instruction-aware visual feature aggregation.
Our model is termed Prompt-guided Pooling LLaVA, or PPLLaVA for short.
arXiv Detail & Related papers (2024-11-04T17:50:36Z) - ShareGPT4Video: Improving Video Understanding and Generation with Better Captions [93.29360532845062]
We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions.
The series comprises: ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating strategy.
We further develop ShareCaptioner-Video, a superior captioner capable of efficiently generating high-quality captions for arbitrary videos.
arXiv Detail & Related papers (2024-06-06T17:58:54Z) - A Video is Worth 10,000 Words: Training and Benchmarking with Diverse
Captions for Better Long Video Retrieval [43.58794386905177]
Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime.
This neglects the richness and variety of possible valid descriptions of a video.
We propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos.
arXiv Detail & Related papers (2023-11-30T18:59:45Z) - VicTR: Video-conditioned Text Representations for Activity Recognition [73.09929391614266]
We argue that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information.
We introduce Video-conditioned Text Representations (VicTR), a form of text embeddings optimized w.r.t. visual embeddings.
Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text.
arXiv Detail & Related papers (2023-04-05T16:30:36Z) - HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z) - Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? [131.300931102986]
In real-world scenarios, online videos are often accompanied by relevant text information such as titles, tags, and even subtitles.
We propose a novel approach to text-video retrieval, where we directly generate associated captions from videos using zero-shot video captioning.
We conduct comprehensive ablation studies to demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-12-31T11:50:32Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.