ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information
- URL: http://arxiv.org/abs/2411.19668v1
- Date: Fri, 29 Nov 2024 12:48:49 GMT
- Title: ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information
- Authors: Wanyue Zhang, Ziyong Li, Wen Yang, Chunlin Leng, Yinan Bai, Qianlong Du, Chengqing Zong, Jiajun Zhang,
- Abstract summary: We propose a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information.
We release the largest, high-quality and fine-grained Chinese text ChineseWebText2.0, which consists of 3.8TB and each text is associated with a quality score, domain labels, a toxicity label and a toxicity score.
- Score: 29.57708536491853
- License:
- Abstract: During the development of large language models (LLMs), pre-training data play a critical role in shaping LLMs' capabilities. In recent years several large-scale and high-quality pre-training datasets have been released to accelerate the research of LLMs, including ChineseWebText1.0, C4, Pile, WanJuan, MAPCC and others. However, as LLMs continue to evolve, focus has increasingly shifted to domain-specific capabilities and safety concerns, making those previous coarse-grained texts insufficient for meeting training requirements. Furthermore, fine-grained information, such as quality, domain and toxicity, is becoming increasingly important in building powerful and reliable LLMs for various scenarios. To address these challenges, in this paper we propose a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information. First, we employ manually crafted rules to discard explicit noisy texts from raw contents. Second, the quality evaluation model, domain classifier, and toxicity evaluation model are well-designed to assess the remaining cleaned data respectively. Finally, we integrate these three types of fine-grained information for each text. With this approach, we release the largest, high-quality and fine-grained Chinese text ChineseWebText2.0, which consists of 3.8TB and each text is associated with a quality score, domain labels, a toxicity label and a toxicity score, facilitating the LLM researchers to select data based on various types of fine-grained information. The data, codes and the tool-chain are available on this website https://github.com/CASIA-LM/ChineseWebText-2.0
Related papers
- mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data [71.352883755806]
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space.
However, the limited labeled multimodal data often hinders embedding performance.
Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck.
arXiv Detail & Related papers (2025-02-12T15:03:33Z) - OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training [5.372706159579268]
The OpenCSG Chinese Corpus is a series of high-quality datasets specifically designed for Chinese language training.
This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese.
The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes.
arXiv Detail & Related papers (2025-01-14T15:22:47Z) - RedPajama: an Open Dataset for Training Large Language Models [80.74772646989423]
We identify three core data-related challenges that must be addressed to advance open-source language models.
These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis.
We release RedPajama-V1, an open reproduction of the LLaMA training dataset, and RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata.
arXiv Detail & Related papers (2024-11-19T09:35:28Z) - Integrating Planning into Single-Turn Long-Form Text Generation [66.08871753377055]
We propose to use planning to generate long form content.
Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning.
Our experiments demonstrate on two datasets from different domains, that LLMs fine-tuned with the auxiliary task generate higher quality documents.
arXiv Detail & Related papers (2024-10-08T17:02:40Z) - Open Artificial Knowledge [3.187724439601991]
We introduce the Open Artificial Knowledge (OAK) dataset, a large-scale resource of over 500 million tokens.
The OAK dataset aims to foster the development of more capable and aligned language models.
arXiv Detail & Related papers (2024-07-19T15:01:24Z) - CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models [36.82189550072201]
Existing text-to-table datasets are typically oriented English.
Large language models (LLMs) have shown great success as general task solvers in multi-lingual settings.
We propose a Chinese text-to-table dataset, CT-Eval, to benchmark LLMs on this task.
arXiv Detail & Related papers (2024-05-20T16:58:02Z) - LongWanjuan: Towards Systematic Measurement for Long Text Quality [102.46517202896521]
LongWanjuan is a dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens.
In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality.
We devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks.
arXiv Detail & Related papers (2024-02-21T07:27:18Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with
Effective Evaluation Model [40.23569361268597]
We propose a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data.
We release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score.
arXiv Detail & Related papers (2023-11-02T11:13:51Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.