Related papers: Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

URL: http://arxiv.org/abs/2010.14571v2
Date: Thu, 29 Oct 2020 15:18:35 GMT
Title: Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
Authors: Isaac Caswell, Theresa Breiner, Daan van Esch, Ankur Bapna
Abstract summary: We train LangID models on up to 1,629 languages with comparable quality on held-out test sets. We find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters and transformer-based semi-supervised LangID models.
Score: 15.807197703827818
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2%. These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.

Related papers

DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection [81.9128248739811]
We introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus built using newly extracted Common Crawl data. DCAD-2000 includes over 2,282 languages, 46.72TB of data, and 8.63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts. We evaluate the quality of DCAD-2000 on the FineTask benchmark, demonstrating substantial improvements in multilingual dataset quality and task performance.
arXiv Detail & Related papers (2025-02-17T08:28:29Z)
A New Method for Cross-Lingual-based Semantic Role Labeling [5.992526851963307]
A deep learning algorithm is proposed to train semantic role labeling in English and Persian. The results show significant improvements compared to Niksirt et al.'s model. The development of cross-lingual methods for semantic role labeling holds promise.
arXiv Detail & Related papers (2024-08-28T16:06:12Z)
GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text [39.846419973203744]
We compile the largest existing corpus of interlinear glossed text (IGT) data from a variety of sources, covering over 450k examples across 1.8k languages. We normalize much of our data to follow a standard set of labels across languages. As many languages lack sufficient monolingual data, we pretrain a large multilingual model on our corpus. We demonstrate the utility of this model by finetuning it on monolingual corpora, outperforming SOTA models by up to 6.6%.
arXiv Detail & Related papers (2024-03-11T03:21:15Z)
TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models [50.40191599304911]
We propose TransliCo to fine-tune an mPLM by contrasting sentences in its training data and their transliterations in a unified script. We show that Furina outperforms the original Glot500-m on various zero-shot crosslingual transfer tasks.
arXiv Detail & Related papers (2024-01-12T15:12:48Z)
Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z)
LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages [27.675441924635294]
Current systems cannot accurately identify most of the world's 7000 languages. We first compile a corpus, MCS-350, of 50K multilingual and parallel children's stories in 350+ languages. We propose a novel misprediction-resolution hierarchical model, LIMIt, for language identification.
arXiv Detail & Related papers (2023-05-23T17:15:43Z)
An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages. We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z)
OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource Language Pair for Low-Resource Sentence Retrieval [91.76575626229824]
We present OneAligner, an alignment model specially designed for sentence retrieval tasks. When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result. We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
arXiv Detail & Related papers (2022-05-17T19:52:42Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements. We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z)
MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark [31.91964553419665]
We present a new multilingual dataset, called MTOP, comprising of 100k annotated utterances in 6 languages across 11 domains. We achieve an average improvement of +6.3 points on Slot F1 for the two existing multilingual datasets, over best results reported in their experiments. We demonstrate strong zero-shot performance using pre-trained models combined with automatic translation and alignment, and a proposed distant supervision method to reduce the noise in slot label projection.
arXiv Detail & Related papers (2020-08-21T07:02:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.