Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
- URL: http://arxiv.org/abs/2211.01335v3
- Date: Tue, 23 May 2023 01:28:21 GMT
- Title: Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
- Authors: An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren
Zhou, Chang Zhou
- Abstract summary: We construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets.
We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters.
Our experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN.
- Score: 55.95225353842118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The tremendous success of CLIP (Radford et al., 2021) has promoted the
research and application of contrastive learning for vision-language
pretraining. In this work, we construct a large-scale dataset of image-text
pairs in Chinese, where most data are retrieved from publicly available
datasets, and we pretrain Chinese CLIP models on the new dataset. We develop 5
Chinese CLIP models of multiple sizes, spanning from 77 to 958 million
parameters. Furthermore, we propose a two-stage pretraining method, where the
model is first trained with the image encoder frozen and then trained with all
parameters being optimized, to achieve enhanced model performance. Our
comprehensive experiments demonstrate that Chinese CLIP can achieve the
state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN in the setups
of zero-shot learning and finetuning, and it is able to achieve competitive
performance in zero-shot image classification based on the evaluation on the
ELEVATER benchmark (Li et al., 2022). We have released our codes, models, and
demos in https://github.com/OFA-Sys/Chinese-CLIP
Related papers
- A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene [11.265838907079196]
We propose a conceptually simple yet effective multilingual CLIP Compression framework and train a lightweight multilingual vision-language model, called DC-CLIP, for both Chinese and English context.
In this framework, we collect high-quality Chinese and English text-image pairs and design two training stages, including multilingual vision-language feature distillation and alignment.
Comprehensive experiments in zero-shot image classification, conducted based on the ELEVATER benchmark, showcase that DC-CLIP achieves superior performance in the English context.
arXiv Detail & Related papers (2024-04-17T10:56:06Z) - Embracing Language Inclusivity and Diversity in CLIP through Continual
Language Learning [58.92843729869586]
Vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, but their mastery in a few languages like English restricts their applicability in broader communities.
We propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF)
We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance.
arXiv Detail & Related papers (2024-01-30T17:14:05Z) - NLLB-CLIP -- train performant multilingual image retrieval model on a
budget [65.268245109828]
We present NLLB-CLIP - CLIP model with a text encoder from the NLLB model.
We used an automatically created dataset of 106,246 good-quality images with captions in 201 languages.
We show that NLLB-CLIP is comparable in quality to state-of-the-art models and significantly outperforms them on low-resource languages.
arXiv Detail & Related papers (2023-09-04T23:26:11Z) - Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for
Pre-training and Benchmarks [63.09588102724274]
We release the largest public Chinese high-quality video-language dataset named Youku-mPLUG.
Youku-mPLUG contains 10 million Chinese video-text pairs filtered from 400 million raw videos across a wide range of 45 diverse categories for large-scale pre-training.
We build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification.
arXiv Detail & Related papers (2023-06-07T11:52:36Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - CCMB: A Large-scale Chinese Cross-modal Benchmark [46.349966178044184]
We build a large-scale high-quality Chinese Cross-Modal Benchmark named CCMB for the research community.
Zero contains 250 million images paired with 750 million text descriptions, plus two of the five fine-tuning datasets are also the largest ones for Chinese cross-modal downstream tasks.
arXiv Detail & Related papers (2022-05-08T13:19:23Z) - Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset
and A Foundation Framework [99.38817546900405]
This paper presents a large-scale Chinese cross-modal dataset for benchmarking different multi-modal pre-training methods.
We release a Large-Scale Chinese Cross-modal dataset named Wukong, containing 100 million Chinese image-text pairs from the web.
arXiv Detail & Related papers (2022-02-14T14:37:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.