esCorpius: A Massive Spanish Crawling Corpus
- URL: http://arxiv.org/abs/2206.15147v2
- Date: Fri, 1 Jul 2022 08:22:32 GMT
- Title: esCorpius: A Massive Spanish Crawling Corpus
- Authors: Asier Guti\'errez-Fandi\~no, David P\'erez-Fern\'andez, Jordi
Armengol-Estap\'e, David Griol, Zoraida Callejas
- Abstract summary: esCorpius is a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data.
It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content.
- Score: 2.262838186547612
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the recent years, transformer-based models have lead to significant
advances in language modelling for natural language processing. However, they
require a vast amount of data to be (pre-)trained and there is a lack of
corpora in languages other than English. Recently, several initiatives have
presented multilingual datasets obtained from automatic web crawling. However,
the results in Spanish present important shortcomings, as they are either too
small in comparison with other languages, or present a low quality derived from
sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius,
a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is
the most extensive corpus in Spanish with this level of quality in the
extraction, purification and deduplication of web textual content. Our data
curation process involves a novel highly parallel cleaning pipeline and
encompasses a series of deduplication mechanisms that together ensure the
integrity of both document and paragraph boundaries. Additionally, we maintain
both the source web page URL and the WARC shard origin URL in order to complain
with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license
and is available on HuggingFace.
Related papers
- Cross-lingual Back-Parsing: Utterance Synthesis from Meaning Representation for Zero-Resource Semantic Parsing [6.074150063191985]
Cross-Lingual Back-Parsing is a novel data augmentation methodology designed to enhance cross-lingual transfer for semantic parsing.
Our methodology effectively performs cross-lingual data augmentation in challenging zero-resource settings.
arXiv Detail & Related papers (2024-10-01T08:53:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - CCpdf: Building a High Quality Corpus for Visually Rich Documents from
Web Crawl Data [2.7843134136364265]
This paper proposes an efficient pipeline for creating a big-scale, diverse, multilingual corpus of PDF files from all over the Internet using Common Crawl.
We also share a CCpdf corpus in a form or an index of PDF files along with a script for downloading them, which produces a collection useful for language model pretraining.
arXiv Detail & Related papers (2023-04-28T16:12:18Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - A Simple Multi-Modality Transfer Learning Baseline for Sign Language
Translation [54.29679610921429]
Existing sign language datasets contain only about 10K-20K pairs of sign videos, gloss annotations and texts.
Data is thus a bottleneck for training effective sign language translation models.
This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks.
arXiv Detail & Related papers (2022-03-08T18:59:56Z) - Towards a Cleaner Document-Oriented Multilingual Crawled Corpus [2.1028463367241033]
This paper takes the existing multilingual web corpus OSCAR and its pipeline Ungoliant and extracts and classifies data from Common Crawl at the line level.
We propose a set of improvements and automatic annotations in order to produce a new document-oriented version of OSCAR that could prove more suitable to pre-train large generative language models.
arXiv Detail & Related papers (2022-01-17T22:12:59Z) - Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish
Biomedical Language Models [0.05277024349608833]
CoWeSe is the result of a massive crawler on 3000 Spanish domains executed in 2020.
The corpus is openly available and already preprocessed.
CoWeSe is an important resource for biomedical and health NLP in Spanish.
arXiv Detail & Related papers (2021-09-16T07:22:28Z) - Documenting the English Colossal Clean Crawled Corpus [28.008953329187648]
This work provides the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl.
We begin with a high-level summary of the data, including distributions of where the text came from and when it was written.
We then give more detailed analysis on salient parts of this data, including the most frequent sources of text.
arXiv Detail & Related papers (2021-04-18T07:42:52Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.