Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for
Pre-training and Benchmarks
- URL: http://arxiv.org/abs/2306.04362v1
- Date: Wed, 7 Jun 2023 11:52:36 GMT
- Title: Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for
Pre-training and Benchmarks
- Authors: Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai
Xu, Anwen Hu, Yaya Shi, Guangwei Xu, Chenliang Li, Qi Qian, Maofei Que, Ji
Zhang, Xiao Zeng, Fei Huang
- Abstract summary: We release the largest public Chinese high-quality video-language dataset named Youku-mPLUG.
Youku-mPLUG contains 10 million Chinese video-text pairs filtered from 400 million raw videos across a wide range of 45 diverse categories for large-scale pre-training.
We build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification.
- Score: 63.09588102724274
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: To promote the development of Vision-Language Pre-training (VLP) and
multimodal Large Language Model (LLM) in the Chinese community, we firstly
release the largest public Chinese high-quality video-language dataset named
Youku-mPLUG, which is collected from Youku, a well-known Chinese video-sharing
website, with strict criteria of safety, diversity, and quality. Youku-mPLUG
contains 10 million Chinese video-text pairs filtered from 400 million raw
videos across a wide range of 45 diverse categories for large-scale
pre-training. In addition, to facilitate a comprehensive evaluation of
video-language models, we carefully build the largest human-annotated Chinese
benchmarks covering three popular video-language tasks of cross-modal
retrieval, video captioning, and video category classification. Youku-mPLUG can
enable researchers to conduct more in-depth multimodal research and develop
better applications in the future. Furthermore, we release popular
video-language pre-training models, ALPRO and mPLUG-2, and our proposed
modularized decoder-only model mPLUG-video pre-trained on Youku-mPLUG.
Experiments show that models pre-trained on Youku-mPLUG gain up to 23.1%
improvement in video category classification. Besides, mPLUG-video achieves a
new state-of-the-art result on these benchmarks with 80.5% top-1 accuracy in
video category classification and 68.9 CIDEr score in video captioning,
respectively. Finally, we scale up mPLUG-video based on the frozen Bloomz with
only 1.7% trainable parameters as Chinese multimodal LLM, and demonstrate
impressive instruction and video understanding ability. The zero-shot
instruction understanding experiment indicates that pretraining with
Youku-mPLUG can enhance the ability to comprehend overall and detailed visual
semantics, recognize scene text, and leverage open-domain knowledge.
Related papers
- YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Learning Video Representations from Large Language Models [31.11998135196614]
We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs)
We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators.
Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text.
arXiv Detail & Related papers (2022-12-08T18:59:59Z) - Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [55.95225353842118]
We construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets.
We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters.
Our experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN.
arXiv Detail & Related papers (2022-11-02T17:47:23Z) - WeLM: A Well-Read Pre-trained Language Model for Chinese [37.68378062625651]
We present WeLM: a well-read pre-trained language model for Chinese.
We show that WeLM is equipped with broad knowledge on various domains and languages.
arXiv Detail & Related papers (2022-09-21T14:05:30Z) - CREATE: A Benchmark for Chinese Short Video Retrieval and Title
Generation [54.7561946475866]
We propose CREATE, the first large-scale Chinese shoRt vidEo retrievAl and Title gEn benchmark, to facilitate research and application in video titling and video retrieval in Chinese.
CREATE consists of a high-quality labeled 210K dataset and two large-scale 3M/10M pre-training datasets, covering 51 categories, 50K+ tags, 537K manually annotated titles and captions, and 10M+ short videos.
Based on CREATE, we propose a novel model ALWIG which combines video retrieval and video titling tasks to achieve the purpose of multi-modal ALignment WI
arXiv Detail & Related papers (2022-03-31T02:39:18Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z) - Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual
Transfer of Vision-Language Models [144.85290716246533]
We study zero-shot cross-lingual transfer of vision-language models.
We propose a Transformer-based model that learns contextualized multilingual multimodal embeddings.
arXiv Detail & Related papers (2021-03-16T04:37:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.