LIMIT: Language Identification, Misidentification, and Translation using
Hierarchical Models in 350+ Languages
- URL: http://arxiv.org/abs/2305.14263v2
- Date: Mon, 6 Nov 2023 16:29:21 GMT
- Title: LIMIT: Language Identification, Misidentification, and Translation using
Hierarchical Models in 350+ Languages
- Authors: Milind Agarwal, Md Mahfuz Ibn Alam, Antonios Anastasopoulos
- Abstract summary: Current systems cannot accurately identify most of the world's 7000 languages.
We first compile a corpus, MCS-350, of 50K multilingual and parallel children's stories in 350+ languages.
We propose a novel misprediction-resolution hierarchical model, LIMIt, for language identification.
- Score: 27.675441924635294
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Knowing the language of an input text/audio is a necessary first step for
using almost every NLP tool such as taggers, parsers, or translation systems.
Language identification is a well-studied problem, sometimes even considered
solved; in reality, due to lack of data and computational challenges, current
systems cannot accurately identify most of the world's 7000 languages. To
tackle this bottleneck, we first compile a corpus, MCS-350, of 50K multilingual
and parallel children's stories in 350+ languages. MCS-350 can serve as a
benchmark for language identification of short texts and for 1400+ new
translation directions in low-resource Indian and African languages. Second, we
propose a novel misprediction-resolution hierarchical model, LIMIt, for
language identification that reduces error by 55% (from 0.71 to 0.32) on our
compiled children's stories dataset and by 40% (from 0.23 to 0.14) on the
FLORES-200 benchmark. Our method can expand language identification coverage
into low-resource languages by relying solely on systemic misprediction
patterns, bypassing the need to retrain large models from scratch.
Related papers
- Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Paramanu: A Family of Novel Efficient Generative Foundation Language Models for Indian Languages [3.9018931027384056]
We present "Paramanu", a family of novel language models (LM) for Indian languages.
It covers 10 languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil, Telugu) across 5 scripts.
The models are pretrained on a single GPU with context size of 1024 and vary in size from 13.29 million (M) to 367.5 M parameters.
arXiv Detail & Related papers (2024-01-31T17:58:10Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Learning Natural Language Generation from Scratch [25.984828046001013]
This paper introduces TRUncated ReinForcement Learning for Language (TrufLL)
It is an original ap-proach to train conditional language models from scratch by only using reinforcement learning (RL)
arXiv Detail & Related papers (2021-09-20T08:46:51Z) - Language ID in the Wild: Unexpected Challenges on the Path to a
Thousand-Language Web Text Corpus [15.807197703827818]
We train LangID models on up to 1,629 languages with comparable quality on held-out test sets.
We find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages.
We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters and transformer-based semi-supervised LangID models.
arXiv Detail & Related papers (2020-10-27T19:29:17Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.