CREATE: A Benchmark for Chinese Short Video Retrieval and Title
Generation
- URL: http://arxiv.org/abs/2203.16763v1
- Date: Thu, 31 Mar 2022 02:39:18 GMT
- Title: CREATE: A Benchmark for Chinese Short Video Retrieval and Title
Generation
- Authors: Ziqi Zhang, Yuxin Chen, Zongyang Ma, Zhongang Qi, Chunfeng Yuan, Bing
Li, Ying Shan, Weiming Hu
- Abstract summary: We propose CREATE, the first large-scale Chinese shoRt vidEo retrievAl and Title gEn benchmark, to facilitate research and application in video titling and video retrieval in Chinese.
CREATE consists of a high-quality labeled 210K dataset and two large-scale 3M/10M pre-training datasets, covering 51 categories, 50K+ tags, 537K manually annotated titles and captions, and 10M+ short videos.
Based on CREATE, we propose a novel model ALWIG which combines video retrieval and video titling tasks to achieve the purpose of multi-modal ALignment WI
- Score: 54.7561946475866
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous works of video captioning aim to objectively describe the video's
actual content, which lacks subjective and attractive expression, limiting its
practical application scenarios. Video titling is intended to achieve this
goal, but there is a lack of a proper benchmark. In this paper, we propose to
CREATE, the first large-scale Chinese shoRt vidEo retrievAl and Title
gEneration benchmark, to facilitate research and application in video titling
and video retrieval in Chinese. CREATE consists of a high-quality labeled 210K
dataset and two large-scale 3M/10M pre-training datasets, covering 51
categories, 50K+ tags, 537K manually annotated titles and captions, and 10M+
short videos. Based on CREATE, we propose a novel model ALWIG which combines
video retrieval and video titling tasks to achieve the purpose of multi-modal
ALignment WIth Generation with the help of video tags and a GPT pre-trained
model. CREATE opens new directions for facilitating future research and
applications on video titling and video retrieval in the field of Chinese short
videos.
Related papers
- AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark [73.62572976072578]
We propose AuroraCap, a video captioner based on a large multimodal model.
We implement the token merging strategy, reducing the number of input visual tokens.
AuroraCap shows superior performance on various video and image captioning benchmarks.
arXiv Detail & Related papers (2024-10-04T00:13:54Z) - Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - Shot2Story20K: A New Benchmark for Comprehensive Understanding of
Multi-shot Videos [58.13927287437394]
We present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries.
Preliminary experiments show some challenges to generate a long and comprehensive video summary.
arXiv Detail & Related papers (2023-12-16T03:17:30Z) - StoryBench: A Multifaceted Benchmark for Continuous Story Visualization [42.439670922813434]
We introduce StoryBench: a new, challenging multi-task benchmark to reliably evaluate text-to-video models.
Our benchmark includes three video generation tasks of increasing difficulty: action execution, story continuation, and story generation.
We evaluate small yet strong text-to-video baselines, and show the benefits of training on story-like data algorithmically generated from existing video captions.
arXiv Detail & Related papers (2023-08-22T17:53:55Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for
Pre-training and Benchmarks [63.09588102724274]
We release the largest public Chinese high-quality video-language dataset named Youku-mPLUG.
Youku-mPLUG contains 10 million Chinese video-text pairs filtered from 400 million raw videos across a wide range of 45 diverse categories for large-scale pre-training.
We build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification.
arXiv Detail & Related papers (2023-06-07T11:52:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.