CCMB: A Large-scale Chinese Cross-modal Benchmark
- URL: http://arxiv.org/abs/2205.03860v6
- Date: Wed, 8 Nov 2023 09:45:00 GMT
- Title: CCMB: A Large-scale Chinese Cross-modal Benchmark
- Authors: Chunyu Xie, Heng Cai, Jincheng Li, Fanjing Kong, Xiaoyu Wu, Jianfei
Song, Henrique Morimitsu, Lin Yao, Dexin Wang, Xiangzheng Zhang, Dawei Leng,
Baochang Zhang, Xiangyang Ji, Yafeng Deng
- Abstract summary: We build a large-scale high-quality Chinese Cross-Modal Benchmark named CCMB for the research community.
Zero contains 250 million images paired with 750 million text descriptions, plus two of the five fine-tuning datasets are also the largest ones for Chinese cross-modal downstream tasks.
- Score: 46.349966178044184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language pre-training (VLP) on large-scale datasets has shown premier
performance on various downstream tasks. In contrast to plenty of available
benchmarks with English corpus, large-scale pre-training datasets and
downstream datasets with Chinese corpus remain largely unexplored. In this
work, we build a large-scale high-quality Chinese Cross-Modal Benchmark named
CCMB for the research community, which contains the currently largest public
pre-training dataset Zero and five human-annotated fine-tuning datasets for
downstream tasks. Zero contains 250 million images paired with 750 million text
descriptions, plus two of the five fine-tuning datasets are also currently the
largest ones for Chinese cross-modal downstream tasks. Along with the CCMB, we
also develop a VLP framework named R2D2, applying a pre-Ranking + Ranking
strategy to learn powerful vision-language representations and a two-way
distillation method (i.e., target-guided Distillation and feature-guided
Distillation) to further enhance the learning capability. With the Zero and the
R2D2 VLP framework, we achieve state-of-the-art performance on twelve
downstream datasets from five broad categories of tasks including image-text
retrieval, image-text matching, image caption, text-to-image generation, and
zero-shot image classification. The datasets, models, and codes are available
at https://github.com/yuxie11/R2D2
Related papers
- VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - Vision-Language Dataset Distillation [26.886260846439612]
We design the first vision-language dataset distillation method, building on the idea of trajectory matching.
A key challenge is that vision-language datasets do not have a set of discrete classes.
Our proposed method jointly distills image-text pairs in a contrastive formulation.
arXiv Detail & Related papers (2023-08-15T03:22:40Z) - Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [55.95225353842118]
We construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets.
We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters.
Our experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN.
arXiv Detail & Related papers (2022-11-02T17:47:23Z) - LAION-5B: An open large-scale dataset for training next generation
image-text models [16.129935376579326]
We present LAION-5B, a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language.
We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset.
We also provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation.
arXiv Detail & Related papers (2022-10-16T00:08:18Z) - WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models [2.603259641572195]
We introduce a large-scale multi-modal corpora named WuDaoMM, totally containing more than 650M image-text pairs.
About 600 million pairs of data are collected from multiple webpages in which image and caption present weak correlation.
We also release a base version of WuDaoMM with 5 million strong-correlated image-text pairs, which is sufficient to support the common cross-modal model pre-training.
arXiv Detail & Related papers (2022-03-22T06:12:20Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z) - Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset
and A Foundation Framework [99.38817546900405]
This paper presents a large-scale Chinese cross-modal dataset for benchmarking different multi-modal pre-training methods.
We release a Large-Scale Chinese Cross-modal dataset named Wukong, containing 100 million Chinese image-text pairs from the web.
arXiv Detail & Related papers (2022-02-14T14:37:15Z) - Text-Based Person Search with Limited Data [66.26504077270356]
Text-based person search (TBPS) aims at retrieving a target person from an image gallery with a descriptive text query.
We present a framework with two novel components to handle the problems brought by limited data.
arXiv Detail & Related papers (2021-10-20T22:20:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.