InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation
- URL: http://arxiv.org/abs/2307.06942v2
- Date: Thu, 4 Jan 2024 05:00:34 GMT
- Title: InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation
- Authors: Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao
Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu,
Yali Wang, Limin Wang, Yu Qiao
- Abstract summary: InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
- Score: 90.71796406228265
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper introduces InternVid, a large-scale video-centric multimodal
dataset that enables learning powerful and transferable video-text
representations for multimodal understanding and generation. The InternVid
dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M
video clips accompanied by detailed descriptions of total 4.1B words. Our core
contribution is to develop a scalable approach to autonomously build a
high-quality video-text dataset with large language models (LLM), thereby
showcasing its efficacy in learning video-language representation at scale.
Specifically, we utilize a multi-scale approach to generate video-related
descriptions. Furthermore, we introduce ViCLIP, a video-text representation
learning model based on ViT-L. Learned on InternVid via contrastive learning,
this model demonstrates leading zero-shot action recognition and competitive
video retrieval performance. Beyond basic video understanding tasks like
recognition and retrieval, our dataset and model have broad applications. They
are particularly beneficial for generating interleaved video-text data for
learning a video-centric dialogue system, advancing video-to-text and
text-to-video generation research. These proposed resources provide a tool for
researchers and practitioners interested in multimodal video understanding and
generation.
Related papers
- Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.