Related papers: ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model

ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model

URL: http://arxiv.org/abs/2311.01149v2
Date: Fri, 10 Nov 2023 06:28:48 GMT
Title: ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model
Authors: Jianghao Chen, Pu Jian, Tengxiao Xi, Dongyi Yi, Qianlong Du, Chenglin Ding, Guibo Zhu, Chengqing Zong, Jinqiao Wang, Jiajun Zhang
Abstract summary: We propose a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data. We release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score.
Score: 40.23569361268597
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: During the development of large language models (LLMs), the scale and quality of the pre-training data play a crucial role in shaping LLMs' capabilities. To accelerate the research of LLMs, several large-scale datasets, such as C4 [1], Pile [2], RefinedWeb [3] and WanJuan [4], have been released to the public. However, most of the released corpus focus mainly on English, and there is still lack of complete tool-chain for extracting clean texts from web data. Furthermore, fine-grained information of the corpus, e.g. the quality of each text, is missing. To address these challenges, we propose in this paper a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data. First, similar to previous work, manually crafted rules are employed to discard explicit noisy texts from the raw crawled web contents. Second, a well-designed evaluation model is leveraged to assess the remaining relatively clean data, and each text is assigned a specific quality score. Finally, we can easily utilize an appropriate threshold to select the high-quality pre-training data for Chinese. Using our proposed approach, we release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score, facilitating the LLM researchers to choose the data according to the desired quality thresholds. We also release a much cleaner subset of 600 GB Chinese data with the quality exceeding 90%.

Related papers

WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages [62.1053122134059]
The paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages. We have developed a systematic data processing framework tailored for low-resource languages.
arXiv Detail & Related papers (2025-01-24T14:06:29Z)
OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training [5.372706159579268]
The OpenCSG Chinese Corpus is a series of high-quality datasets specifically designed for Chinese language training. This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese. The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes.
arXiv Detail & Related papers (2025-01-14T15:22:47Z)
ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information [29.57708536491853]
We propose a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information. We release the largest, high-quality and fine-grained Chinese text ChineseWebText2.0, which consists of 3.8TB and each text is associated with a quality score, domain labels, a toxicity label and a toxicity score.
arXiv Detail & Related papers (2024-11-29T12:48:49Z)
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Our dataset has 15 times larger scales while maintaining good data quality. We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z)
CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models [36.82189550072201]
Existing text-to-table datasets are typically oriented English. Large language models (LLMs) have shown great success as general task solvers in multi-lingual settings. We propose a Chinese text-to-table dataset, CT-Eval, to benchmark LLMs on this task.
arXiv Detail & Related papers (2024-05-20T16:58:02Z)
WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset [30.73307556909938]
This paper presents WanJuan-CC, a safe and high-quality open-sourced English webtext dataset derived from Common Crawl data. A comprehensive process was designed to handle Common Crawl data, including extraction, rule filtering, fuzzy deduplication, content safety filtering, and data quality filtering.
arXiv Detail & Related papers (2024-02-29T15:49:15Z)
LongWanjuan: Towards Systematic Measurement for Long Text Quality [102.46517202896521]
LongWanjuan is a dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens. In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality. We devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks.
arXiv Detail & Related papers (2024-02-21T07:27:18Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z)
Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection [83.3580786484122]
We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality. We argue that privileging any corpus as high quality entails a language ideology.
arXiv Detail & Related papers (2022-01-25T17:20:04Z)
HintedBT: Augmenting Back-Translation with Quality and Transliteration Hints [7.452359972117693]
Back-translation of target monolingual corpora is a widely used data augmentation strategy for neural machine translation (NMT) We introduce HintedBT -- a family of techniques which provides hints (through tags) to the encoder and decoder. We show that using these hints, both separately and together, significantly improves translation quality.
arXiv Detail & Related papers (2021-09-09T17:43:20Z)
Documenting the English Colossal Clean Crawled Corpus [28.008953329187648]
This work provides the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl. We begin with a high-level summary of the data, including distributions of where the text came from and when it was written. We then give more detailed analysis on salient parts of this data, including the most frequent sources of text.
arXiv Detail & Related papers (2021-04-18T07:42:52Z)
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets [21.375943264243144]
We manually audit the quality of 205 language-specific corpora released with five major public datasets. We find that at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses.
arXiv Detail & Related papers (2021-03-22T17:30:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.