ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with
Effective Evaluation Model
- URL: http://arxiv.org/abs/2311.01149v2
- Date: Fri, 10 Nov 2023 06:28:48 GMT
- Title: ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with
Effective Evaluation Model
- Authors: Jianghao Chen, Pu Jian, Tengxiao Xi, Dongyi Yi, Qianlong Du, Chenglin
Ding, Guibo Zhu, Chengqing Zong, Jinqiao Wang, Jiajun Zhang
- Abstract summary: We propose a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data.
We release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score.
- Score: 40.23569361268597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: During the development of large language models (LLMs), the scale and quality
of the pre-training data play a crucial role in shaping LLMs' capabilities. To
accelerate the research of LLMs, several large-scale datasets, such as C4 [1],
Pile [2], RefinedWeb [3] and WanJuan [4], have been released to the public.
However, most of the released corpus focus mainly on English, and there is
still lack of complete tool-chain for extracting clean texts from web data.
Furthermore, fine-grained information of the corpus, e.g. the quality of each
text, is missing. To address these challenges, we propose in this paper a new
complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data.
First, similar to previous work, manually crafted rules are employed to discard
explicit noisy texts from the raw crawled web contents. Second, a well-designed
evaluation model is leveraged to assess the remaining relatively clean data,
and each text is assigned a specific quality score. Finally, we can easily
utilize an appropriate threshold to select the high-quality pre-training data
for Chinese. Using our proposed approach, we release the largest and latest
large-scale high-quality Chinese web text ChineseWebText, which consists of
1.42 TB and each text is associated with a quality score, facilitating the LLM
researchers to choose the data according to the desired quality thresholds. We
also release a much cleaner subset of 600 GB Chinese data with the quality
exceeding 90%.
Related papers
- OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset.
Our dataset has 15 times larger scales while maintaining good data quality.
We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z) - CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models [36.82189550072201]
Existing text-to-table datasets are typically oriented English.
Large language models (LLMs) have shown great success as general task solvers in multi-lingual settings.
We propose a Chinese text-to-table dataset, CT-Eval, to benchmark LLMs on this task.
arXiv Detail & Related papers (2024-05-20T16:58:02Z) - WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset [30.73307556909938]
This paper presents WanJuan-CC, a safe and high-quality open-sourced English webtext dataset derived from Common Crawl data.
A comprehensive process was designed to handle Common Crawl data, including extraction, rule filtering, fuzzy deduplication, content safety filtering, and data quality filtering.
arXiv Detail & Related papers (2024-02-29T15:49:15Z) - LongWanjuan: Towards Systematic Measurement for Long Text Quality [102.46517202896521]
LongWanjuan is a dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens.
In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality.
We devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks.
arXiv Detail & Related papers (2024-02-21T07:27:18Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - Whose Language Counts as High Quality? Measuring Language Ideologies in
Text Data Selection [83.3580786484122]
We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality.
We argue that privileging any corpus as high quality entails a language ideology.
arXiv Detail & Related papers (2022-01-25T17:20:04Z) - HintedBT: Augmenting Back-Translation with Quality and Transliteration
Hints [7.452359972117693]
Back-translation of target monolingual corpora is a widely used data augmentation strategy for neural machine translation (NMT)
We introduce HintedBT -- a family of techniques which provides hints (through tags) to the encoder and decoder.
We show that using these hints, both separately and together, significantly improves translation quality.
arXiv Detail & Related papers (2021-09-09T17:43:20Z) - Documenting the English Colossal Clean Crawled Corpus [28.008953329187648]
This work provides the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl.
We begin with a high-level summary of the data, including distributions of where the text came from and when it was written.
We then give more detailed analysis on salient parts of this data, including the most frequent sources of text.
arXiv Detail & Related papers (2021-04-18T07:42:52Z) - Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets [21.375943264243144]
We manually audit the quality of 205 language-specific corpora released with five major public datasets.
We find that at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality.
We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses.
arXiv Detail & Related papers (2021-03-22T17:30:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.