CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language
Model
- URL: http://arxiv.org/abs/2003.01355v2
- Date: Thu, 5 Mar 2020 03:20:33 GMT
- Title: CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language
Model
- Authors: Liang Xu, Xuanwei Zhang, Qianqian Dong
- Abstract summary: We introduce the Chinese corpus from CLUE organization, CLUECorpus 2020.
It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl.
We release a new Chinese vocabulary with a size of 8K, which is only one-third of the vocabulary size used in Chinese Bert released by Google.
- Score: 15.469228003507919
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce the Chinese corpus from CLUE organization,
CLUECorpus2020, a large-scale corpus that can be used directly for
self-supervised learning such as pre-training of a language model, or language
generation. It has 100G raw corpus with 35 billion Chinese characters, which is
retrieved from Common Crawl. To better understand this corpus, we conduct
language understanding experiments on both small and large scale, and results
show that the models trained on this corpus can achieve excellent performance
on Chinese. We release a new Chinese vocabulary with a size of 8K, which is
only one-third of the vocabulary size used in Chinese Bert released by Google.
It saves computational cost and memory while works as good as original
vocabulary. We also release both large and tiny versions of the pre-trained
model on this corpus. The former achieves the state-of-the-art result, and the
latter retains most precision while accelerating training and prediction speed
for eight times compared to Bert-base. To facilitate future work on
self-supervised learning on Chinese, we release our dataset, new vocabulary,
codes, and pre-trained models on Github.
Related papers
- Large Vocabulary Size Improves Large Language Models [28.83786065307658]
We investigate the relationship between subword vocabulary size and the performance of large language models (LLMs)
Experimental results show that larger vocabulary sizes lead to better performance in LLMs.
We introduce a simple method to use a new vocabulary instead of the pre-defined one.
arXiv Detail & Related papers (2024-06-24T10:27:07Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts.
We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z) - Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [55.95225353842118]
We construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets.
We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters.
Our experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN.
arXiv Detail & Related papers (2022-11-02T17:47:23Z) - Pre-training Data Quality and Quantity for a Low-Resource Language: New
Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language.
We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance.
We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z) - YACLC: A Chinese Learner Corpus with Multidimensional Annotation [45.304130762057945]
We construct a large-scale, multidimensional annotated Chinese learner corpus.
By analyzing the original sentences and annotations in the corpus, we found that YACLC has a considerable size and very high annotation quality.
arXiv Detail & Related papers (2021-12-30T13:07:08Z) - Allocating Large Vocabulary Capacity for Cross-lingual Language Model
Pre-training [59.571632468137075]
We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity.
We propose an algorithm VoCap to determine the desired vocabulary capacity of each language.
In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax.
arXiv Detail & Related papers (2021-09-15T14:04:16Z) - Improving Cross-Lingual Reading Comprehension with Self-Training [62.73937175625953]
Current state-of-the-art models even surpass human performance on several benchmarks.
Previous works have revealed the abilities of pre-trained multilingual models for zero-shot cross-lingual reading comprehension.
This paper further utilized unlabeled data to improve the performance.
arXiv Detail & Related papers (2021-05-08T08:04:30Z) - KR-BERT: A Small-Scale Korean-Specific Language Model [0.0]
We trained a Korean-specific model KR-BERT, utilizing a smaller vocabulary and dataset.
Our model performed comparably and even better than other existing pre-trained models using a corpus about 1/10 of the size.
arXiv Detail & Related papers (2020-08-10T09:26:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.