Related papers: WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages

WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages

URL: http://arxiv.org/abs/2501.14506v1
Date: Fri, 24 Jan 2025 14:06:29 GMT
Title: WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages
Authors: Jia Yu, Fei Yuan, Rui Min, Jing Yu, Pei Chu, Jiayang Li, Wei Li, Ruijie Zhang, Zhenxiang Li, Zhifei Ren, Dong Zheng, Wenjian Zhang, Yan Teng, Lingyu Meng, ZhenJiang Jin, Jiantao Qiu, ShaSha Wang, Zhongying Tu, Dahua Lin, Yu Wang, Yu Qiao, Yanfeng Wang, Conghui He,
Abstract summary: The paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages.<n>We have developed a systematic data processing framework tailored for low-resource languages.
Score: 62.1053122134059
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages, thereby advancing the research and development of multilingual models. To achieve this, we have developed a systematic data processing framework tailored for low-resource languages. This framework encompasses key stages such as data extraction, corpus cleaning, content deduplication, security filtering, quality evaluation, and theme classification. Through the implementation of this framework, we have significantly improved both the quality and security of the dataset, while maintaining its linguistic diversity. As of now, data for all five languages have been fully open-sourced. The dataset can be accessed at https://opendatalab.com/applyMultilingualCorpus, and GitHub repository is available at https://github.com/opendatalab/WanJuan3.0

Related papers

SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods [1.2091341579150698]
We release datasets of sentences containing polysemous words across ten low-resource languages.<n>To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method.<n>Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation.
arXiv Detail & Related papers (2025-05-29T17:48:08Z)
Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models [52.22235443948351]
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs)<n>Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale.<n>JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings.
arXiv Detail & Related papers (2025-05-28T11:06:54Z)
Matina: A Large-Scale 73B Token Persian Text Corpus [1.396406461086233]
Existing Persian datasets are typically small and lack content diversity, consisting mainly of weblogs and news articles. Matina corpus is a new Persian dataset of 72.9B tokens, carefully preprocessed and deduplicated to ensure high data quality.
arXiv Detail & Related papers (2025-02-13T11:22:19Z)
Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language. These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z)
Zero-shot Sentiment Analysis in Low-Resource Languages Using a Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets. We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z)
GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Learning Translation Quality Evaluation on Low Resource Languages from Large Language Models [4.168157981135698]
We show how knowledge can be distilled from Large Language Models (LLMs) to improve upon learned metrics without requiring human annotators. We show that the performance of a BLEURT-like model on lower resource languages can be improved in this way.
arXiv Detail & Related papers (2023-02-07T14:35:35Z)
Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources [38.814057529254846]
We examine the characteristics of 156 publicly available NLP datasets. We survey language-proficient NLP researchers and crowd workers per language. We identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform.
arXiv Detail & Related papers (2022-11-28T18:54:33Z)
Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge. It focuses on ad hoc retrieval across 18 different languages. Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z)
From Masked Language Modeling to Translation: Non-English Auxiliary Tasks Improve Zero-shot Spoken Language Understanding [24.149299722716155]
We introduce xSID, a new benchmark for cross-lingual Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect. We propose a joint learning approach, with English SLU training data and non-English auxiliary tasks from raw text, syntax and translation for transfer. Our results show that jointly learning the main tasks with masked language modeling is effective for slots, while machine translation transfer works best for intent classification.
arXiv Detail & Related papers (2021-05-15T23:51:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.