CREATE: A Benchmark for Chinese Short Video Retrieval and Title
Generation
- URL: http://arxiv.org/abs/2203.16763v1
- Date: Thu, 31 Mar 2022 02:39:18 GMT
- Title: CREATE: A Benchmark for Chinese Short Video Retrieval and Title
Generation
- Authors: Ziqi Zhang, Yuxin Chen, Zongyang Ma, Zhongang Qi, Chunfeng Yuan, Bing
Li, Ying Shan, Weiming Hu
- Abstract summary: We propose CREATE, the first large-scale Chinese shoRt vidEo retrievAl and Title gEn benchmark, to facilitate research and application in video titling and video retrieval in Chinese.
CREATE consists of a high-quality labeled 210K dataset and two large-scale 3M/10M pre-training datasets, covering 51 categories, 50K+ tags, 537K manually annotated titles and captions, and 10M+ short videos.
Based on CREATE, we propose a novel model ALWIG which combines video retrieval and video titling tasks to achieve the purpose of multi-modal ALignment WI
- Score: 54.7561946475866
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous works of video captioning aim to objectively describe the video's
actual content, which lacks subjective and attractive expression, limiting its
practical application scenarios. Video titling is intended to achieve this
goal, but there is a lack of a proper benchmark. In this paper, we propose to
CREATE, the first large-scale Chinese shoRt vidEo retrievAl and Title
gEneration benchmark, to facilitate research and application in video titling
and video retrieval in Chinese. CREATE consists of a high-quality labeled 210K
dataset and two large-scale 3M/10M pre-training datasets, covering 51
categories, 50K+ tags, 537K manually annotated titles and captions, and 10M+
short videos. Based on CREATE, we propose a novel model ALWIG which combines
video retrieval and video titling tasks to achieve the purpose of multi-modal
ALignment WIth Generation with the help of video tags and a GPT pre-trained
model. CREATE opens new directions for facilitating future research and
applications on video titling and video retrieval in the field of Chinese short
videos.
Related papers
- SCBench: A Sports Commentary Benchmark for Video LLMs [19.13963551534595]
We develop a benchmark for sports video commentary generation for Video Large Language Models (Video LLMs)
$textbfSCBench$ is a six-dimensional metric specifically designed for our task, upon which we propose a GPT-based evaluation method.
Our results found InternVL-Chat-2 achieves the best performance with 5.44, surpassing the second-best by 1.04.
arXiv Detail & Related papers (2024-12-23T15:13:56Z) - Video Repurposing from User Generated Content: A Large-scale Dataset and Benchmark [5.76230561819199]
We propose Repurpose-10K, an extensive dataset comprising over 10,000 videos with more than 120,000 annotated clips.
We propose a two-stage solution to obtain annotations from real-world user-generated content.
We offer a baseline model to address this challenging task by integrating audio, visual, and caption aspects.
arXiv Detail & Related papers (2024-12-12T02:27:46Z) - AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark [89.73538448786405]
We propose AuroraCap, a video captioner based on a large multimodal model.
We implement the token merging strategy, reducing the number of input visual tokens.
AuroraCap shows superior performance on various video and image captioning benchmarks.
arXiv Detail & Related papers (2024-10-04T00:13:54Z) - Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for
Pre-training and Benchmarks [63.09588102724274]
We release the largest public Chinese high-quality video-language dataset named Youku-mPLUG.
Youku-mPLUG contains 10 million Chinese video-text pairs filtered from 400 million raw videos across a wide range of 45 diverse categories for large-scale pre-training.
We build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification.
arXiv Detail & Related papers (2023-06-07T11:52:36Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.