Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
- URL: http://arxiv.org/abs/2402.19479v1
- Date: Thu, 29 Feb 2024 18:59:50 GMT
- Title: Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
- Authors: Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina
Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian
Ren, Ming-Hsuan Yang, Sergey Tulyakov
- Abstract summary: We propose an automatic approach to establish a video dataset with high-quality captions.
Specifically, we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset.
We then apply multiple cross-modality teacher models to obtain captions for each video.
In this way, we get 70M videos paired with high-quality text captions.
- Score: 93.65253661843145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The quality of the data and annotation upper-bounds the quality of a
downstream model. While there exist large text corpora and image-text pairs,
high-quality video-text data is much harder to collect. First of all, manual
labeling is more time-consuming, as it requires an annotator to watch an entire
video. Second, videos have a temporal dimension, consisting of several scenes
stacked together, and showing multiple actions. Accordingly, to establish a
video dataset with high-quality captions, we propose an automatic approach
leveraging multimodal inputs, such as textual video description, subtitles, and
individual video frames. Specifically, we curate 3.8M high-resolution videos
from the publicly available HD-VILA-100M dataset. We then split them into
semantically consistent video clips, and apply multiple cross-modality teacher
models to obtain captions for each video. Next, we finetune a retrieval model
on a small subset where the best caption of each video is manually selected and
then employ the model in the whole dataset to select the best caption as the
annotation. In this way, we get 70M videos paired with high-quality text
captions. We dub the dataset as Panda-70M. We show the value of the proposed
dataset on three downstream tasks: video captioning, video and text retrieval,
and text-driven video generation. The models trained on the proposed data score
substantially better on the majority of metrics across all the tasks.
Related papers
- Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - Distilling Vision-Language Models on Millions of Videos [62.92789440875999]
We fine-tune a video-language model from a strong image-language baseline with synthesized instructional data.
The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions.
As a side product, we generate the largest video caption dataset to date.
arXiv Detail & Related papers (2024-01-11T18:59:53Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - Scalable and Accurate Self-supervised Multimodal Representation Learning
without Aligned Video and Text Data [18.479220305684837]
Recent advances in image captioning allow us to pre-train high-quality video models without parallel video-text data.
We show that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions.
arXiv Detail & Related papers (2023-04-04T19:11:05Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.