OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training
- URL: http://arxiv.org/abs/2501.08197v1
- Date: Tue, 14 Jan 2025 15:22:47 GMT
- Title: OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training
- Authors: Yijiong Yu, Ziyun Dai, Zekun Wang, Wei Wang, Ran Chen, Ji Pei,
- Abstract summary: The OpenCSG Chinese Corpus is a series of high-quality datasets specifically designed for Chinese language training.
This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese.
The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes.
- Score: 5.372706159579268
- License:
- Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, but their success heavily relies on the quality of pretraining corpora. For Chinese LLMs, the scarcity of high-quality Chinese datasets presents a significant challenge, often limiting their performance. To address this issue, we propose the OpenCSG Chinese Corpus, a series of high-quality datasets specifically designed for LLM pretraining, post-training, and fine-tuning. This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese, each with distinct characteristics: Fineweb-edu datasets focus on filtered, high-quality content derived from diverse Chinese web sources; Cosmopedia-chinese provides synthetic, textbook-style data for knowledge-intensive training; and Smoltalk-chinese emphasizes stylistic and diverse chat-format data. The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes. Additionally, we conducted extensive experimental analyses, including evaluations on smaller parameter models, which demonstrated significant performance improvements in tasks such as C-Eval, showcasing the effectiveness of the corpus for training Chinese LLMs.
Related papers
- ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information [29.57708536491853]
We propose a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information.
We release the largest, high-quality and fine-grained Chinese text ChineseWebText2.0, which consists of 3.8TB and each text is associated with a quality score, domain labels, a toxicity label and a toxicity score.
arXiv Detail & Related papers (2024-11-29T12:48:49Z) - CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation [49.41531871253317]
We present a new Chinese Vision- Language Understanding Evaluation benchmark dataset.
The selection of object categories and images is entirely driven by Chinese native speakers.
We find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs' understanding of Chinese culture.
arXiv Detail & Related papers (2024-07-01T08:35:37Z) - SCAR: Efficient Instruction-Tuning for Large Language Models via Style Consistency-Aware Response Ranking [56.93151679231602]
This research identifies two key stylistic elements in responses: linguistic form and instructional surprisal.
We introduce Style Consistency-Aware Response Ranking (SCAR), which automatically prioritizes instruction-response pairs in the training set based on their response stylistic consistency.
arXiv Detail & Related papers (2024-06-16T10:10:37Z) - COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning [37.843051974342124]
We introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world resources and undergoing rigorous human verification.
We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets.
The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks.
arXiv Detail & Related papers (2024-03-26T19:24:18Z) - AlignBench: Benchmarking Chinese Alignment of Large Language Models [99.24597941555277]
We introduce AlignBench, a comprehensive benchmark for evaluating Chinese Large Language Models' alignment.
We design a human-in-the-loop data curation pipeline, containing eight main categories, 683 real-scenario rooted queries and corresponding human verified references.
For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judgecitezheng2023judging approach with Chain-of-Thought to generate explanations and final ratings.
arXiv Detail & Related papers (2023-11-30T17:41:30Z) - ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with
Effective Evaluation Model [40.23569361268597]
We propose a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data.
We release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score.
arXiv Detail & Related papers (2023-11-02T11:13:51Z) - Efficient Cross-Lingual Transfer for Chinese Stable Diffusion with
Images as Pivots [80.32906566894171]
We propose IAP, a simple but effective method to transfer English Stable Diffusion into Chinese.
IAP establishes connections of Chinese, English and visual semantics in CLIP's embedding space efficiently.
Experimental results show that our method outperforms several strong Chinese diffusion models with only 5%10% training data.
arXiv Detail & Related papers (2023-05-19T09:20:27Z) - Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca [23.00353889531171]
We propose a method to augment LLaMA with capabilities for understanding and generating Chinese text.
We incorporate secondary pre-training using Chinese data and fine-tune the model with Chinese instruction datasets.
Results on the C-Eval dataset yield competitive performance among the models with several times the size of ours.
arXiv Detail & Related papers (2023-04-17T11:39:53Z) - Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [55.95225353842118]
We construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets.
We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters.
Our experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN.
arXiv Detail & Related papers (2022-11-02T17:47:23Z) - CSL: A Large-scale Chinese Scientific Literature Dataset [30.606855209042603]
We present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396k papers.
To our knowledge, CSL is the first scientific document dataset in Chinese. Also, this semi-structured data is a natural annotation that can constitute many supervised NLP tasks.
We present a benchmark to evaluate the performance of models across scientific domain tasks, i.e., summarization, keyword generation and text classification.
arXiv Detail & Related papers (2022-09-12T06:10:47Z) - Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset
and A Foundation Framework [99.38817546900405]
This paper presents a large-scale Chinese cross-modal dataset for benchmarking different multi-modal pre-training methods.
We release a Large-Scale Chinese Cross-modal dataset named Wukong, containing 100 million Chinese image-text pairs from the web.
arXiv Detail & Related papers (2022-02-14T14:37:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.