GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
- URL: http://arxiv.org/abs/2410.23825v1
- Date: Thu, 31 Oct 2024 11:14:12 GMT
- Title: GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
- Authors: Amir Hossein Kargaran, François Yvon, Hinrich Schütze,
- Abstract summary: GlotCC is a clean, document-level, 2TB general domain corpus derived from CommonCrawl.
We make GlotCC and the system used to generate it available to the research community.
- Score: 53.56700754408902
- License:
- Abstract: The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipeline; and (iii) is rigorously cleaned from noise, making it trustworthy to use. We present GlotCC, a clean, document-level, 2TB general domain corpus derived from CommonCrawl, covering more than 1000 languages. We make GlotCC and the system used to generate it - including the pipeline, language identification model, and filters - available to the research community. Corpus v. 1.0 https://huggingface.co/datasets/cis-lmu/GlotCC-v1, Pipeline v. 3.0 https://github.com/cisnlp/GlotCC.
Related papers
- GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency.
It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - esCorpius: A Massive Spanish Crawling Corpus [2.262838186547612]
esCorpius is a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data.
It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content.
arXiv Detail & Related papers (2022-06-30T09:29:18Z) - mGPT: Few-Shot Learners Go Multilingual [1.4354798873010843]
This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages.
We reproduce the GPT-3 architecture using GPT-2 sources and the sparse attention mechanism.
The resulting models show performance on par with the recently released XGLM models by Facebook.
arXiv Detail & Related papers (2022-04-15T13:02:33Z) - A Systematic Evaluation of Large Language Models of Code [88.34057460577957]
Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions.
The current state-of-the-art code LMs are not publicly available, leaving many questions about their model and data design decisions.
Although Codex is not open-source, we find that existing open-source models do achieve close results in some programming languages.
We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine.
arXiv Detail & Related papers (2022-02-26T15:53:55Z) - Cross-Domain Deep Code Search with Meta Learning [14.618183588410194]
We propose CroCS, a novel approach for domain-specific code search.
CroCS employs a transfer learning framework where an initial program representation model is pre-trained on a large corpus of common programming languages.
arXiv Detail & Related papers (2022-01-01T09:00:48Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.