Language ID in the Wild: Unexpected Challenges on the Path to a
Thousand-Language Web Text Corpus
- URL: http://arxiv.org/abs/2010.14571v2
- Date: Thu, 29 Oct 2020 15:18:35 GMT
- Title: Language ID in the Wild: Unexpected Challenges on the Path to a
Thousand-Language Web Text Corpus
- Authors: Isaac Caswell, Theresa Breiner, Daan van Esch, Ankur Bapna
- Abstract summary: We train LangID models on up to 1,629 languages with comparable quality on held-out test sets.
We find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages.
We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters and transformer-based semi-supervised LangID models.
- Score: 15.807197703827818
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large text corpora are increasingly important for a wide variety of Natural
Language Processing (NLP) tasks, and automatic language identification (LangID)
is a core technology needed to collect such datasets in a multilingual context.
LangID is largely treated as solved in the literature, with models reported
that achieve over 90% average F1 on as many as 1,366 languages. We train LangID
models on up to 1,629 languages with comparable quality on held-out test sets,
but find that human-judged LangID accuracy for web-crawl text corpora created
using these models is only around 5% for many lower-resource languages,
suggesting a need for more robust evaluation. Further analysis revealed a
variety of error modes, arising from domain mismatch, class imbalance, language
similarity, and insufficiently expressive models. We propose two classes of
techniques to mitigate these errors: wordlist-based tunable-precision filters
(for which we release curated lists in about 500 languages) and
transformer-based semi-supervised LangID models, which increase median dataset
precision from 5.5% to 71.2%. These techniques enable us to create an initial
data set covering 100K or more relatively clean sentences in each of 500+
languages, paving the way towards a 1,000-language web text corpus.
Related papers
- A New Method for Cross-Lingual-based Semantic Role Labeling [5.992526851963307]
A deep learning algorithm is proposed to train semantic role labeling in English and Persian.
The results show significant improvements compared to Niksirt et al.'s model.
The development of cross-lingual methods for semantic role labeling holds promise.
arXiv Detail & Related papers (2024-08-28T16:06:12Z) - GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text [39.846419973203744]
We compile the largest existing corpus of interlinear glossed text (IGT) data from a variety of sources, covering over 450k examples across 1.8k languages.
We normalize much of our data to follow a standard set of labels across languages.
As many languages lack sufficient monolingual data, we pretrain a large multilingual model on our corpus.
We demonstrate the utility of this model by finetuning it on monolingual corpora, outperforming SOTA models by up to 6.6%.
arXiv Detail & Related papers (2024-03-11T03:21:15Z) - TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models [50.40191599304911]
We propose TransliCo to fine-tune an mPLM by contrasting sentences in its training data and their transliterations in a unified script.
We show that Furina outperforms the original Glot500-m on various zero-shot crosslingual transfer tasks.
arXiv Detail & Related papers (2024-01-12T15:12:48Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - LIMIT: Language Identification, Misidentification, and Translation using
Hierarchical Models in 350+ Languages [27.675441924635294]
Current systems cannot accurately identify most of the world's 7000 languages.
We first compile a corpus, MCS-350, of 50K multilingual and parallel children's stories in 350+ languages.
We propose a novel misprediction-resolution hierarchical model, LIMIt, for language identification.
arXiv Detail & Related papers (2023-05-23T17:15:43Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource
Language Pair for Low-Resource Sentence Retrieval [91.76575626229824]
We present OneAligner, an alignment model specially designed for sentence retrieval tasks.
When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result.
We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
arXiv Detail & Related papers (2022-05-17T19:52:42Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing
Benchmark [31.91964553419665]
We present a new multilingual dataset, called MTOP, comprising of 100k annotated utterances in 6 languages across 11 domains.
We achieve an average improvement of +6.3 points on Slot F1 for the two existing multilingual datasets, over best results reported in their experiments.
We demonstrate strong zero-shot performance using pre-trained models combined with automatic translation and alignment, and a proposed distant supervision method to reduce the noise in slot label projection.
arXiv Detail & Related papers (2020-08-21T07:02:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.