Language Identification for Austronesian Languages
- URL: http://arxiv.org/abs/2206.04327v1
- Date: Thu, 9 Jun 2022 08:08:18 GMT
- Title: Language Identification for Austronesian Languages
- Authors: Jonathan Dunn and Wikke Nijhof
- Abstract summary: This paper provides language identification models for low- and under-resourced languages in the Pacific region.
We combine 29 Austronesian languages with 171 non-Austronesian languages to create an evaluation set.
Further experiments adapt these language identification models for code-switching detection, achieving high accuracy across all 29 languages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper provides language identification models for low- and
under-resourced languages in the Pacific region with a focus on previously
unavailable Austronesian languages. Accurate language identification is an
important part of developing language resources. The approach taken in this
paper combines 29 Austronesian languages with 171 non-Austronesian languages to
create an evaluation set drawn from eight data sources. After evaluating six
approaches to language identification, we find that a classifier based on
skip-gram embeddings reaches a significantly higher performance than alternate
methods. We then systematically increase the number of non-Austronesian
languages in the model up to a total of 800 languages to evaluate whether an
increased language inventory leads to less precise predictions for the
Austronesian languages of interest. This evaluation finds that there is only a
minimal impact on accuracy caused by increasing the inventory of
non-Austronesian languages. Further experiments adapt these language
identification models for code-switching detection, achieving high accuracy
across all 29 languages.
Related papers
- The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments [57.273662221547056]
In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance.
We observe that the existence of a predominant language during training boosts the performance of less frequent languages.
As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.
arXiv Detail & Related papers (2024-04-11T17:58:05Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Detecting Languages Unintelligible to Multilingual Models through Local
Structure Probes [15.870989191524094]
We develop a general approach that requires only unlabelled text to detect which languages are not well understood by a cross-lingual model.
Our approach is derived from the hypothesis that if a model's understanding is insensitive to perturbations to text in a language, it is likely to have a limited understanding of that language.
arXiv Detail & Related papers (2022-11-09T16:45:16Z) - Transfer Language Selection for Zero-Shot Cross-Lingual Abusive Language
Detection [2.2998722397348335]
Instead of preparing a dataset for every language, we demonstrate the effectiveness of cross-lingual transfer learning for zero-shot abusive language detection.
Our datasets are from seven different languages from three language families.
arXiv Detail & Related papers (2022-06-02T09:53:15Z) - Automatic Spoken Language Identification using a Time-Delay Neural
Network [0.0]
A language identification system was built to distinguish between Arabic, Spanish, French, and Turkish.
A pre-existing multilingual dataset was used to train a series of acoustic models.
The system was provided with a custom multilingual language model and a specialized pronunciation lexicon.
arXiv Detail & Related papers (2022-05-19T13:47:48Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Rnn-transducer with language bias for end-to-end Mandarin-English
code-switching speech recognition [58.105818353866354]
We propose an improved recurrent neural network transducer (RNN-T) model with language bias to alleviate the problem.
We use the language identities to bias the model to predict the CS points.
This promotes the model to learn the language identity information directly from transcription, and no additional LID model is needed.
arXiv Detail & Related papers (2020-02-19T12:01:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.