Evaluating Self-Supervised Speech Representations for Indigenous
American Languages
- URL: http://arxiv.org/abs/2310.03639v2
- Date: Sun, 8 Oct 2023 23:28:50 GMT
- Title: Evaluating Self-Supervised Speech Representations for Indigenous
American Languages
- Authors: Chih-Chen Chen, William Chen, Rodolfo Zevallos, John E. Ortega
- Abstract summary: We present an ASR corpus for Quechua, an indigenous South American Language.
We benchmark the efficacy of large SSL models on Quechua, along with 6 other indigenous languages such as Guarani and Bribri, on low-resource ASR.
Our results show surprisingly strong performance by state-of-the-art SSL models, showing the potential generalizability of large-scale models to real-world data.
- Score: 6.235388047623929
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The application of self-supervision to speech representation learning has
garnered significant interest in recent years, due to its scalability to large
amounts of unlabeled data. However, much progress, both in terms of
pre-training and downstream evaluation, has remained concentrated in
monolingual models that only consider English. Few models consider other
languages, and even fewer consider indigenous ones. In our submission to the
New Language Track of the ASRU 2023 ML-SUPERB Challenge, we present an ASR
corpus for Quechua, an indigenous South American Language. We benchmark the
efficacy of large SSL models on Quechua, along with 6 other indigenous
languages such as Guarani and Bribri, on low-resource ASR. Our results show
surprisingly strong performance by state-of-the-art SSL models, showing the
potential generalizability of large-scale models to real-world data.
Related papers
- Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised Models [60.09618700199927]
We propose adaptation methods which integrate LoRA to existed SSL models to extend new language.
We also develop preservation strategies which include data combination and re-clustering to retain abilities on existed languages.
arXiv Detail & Related papers (2024-06-20T08:13:30Z) - SeaLLMs -- Large Language Models for Southeast Asia [76.50157503379086]
We introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages.
SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning.
Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities.
arXiv Detail & Related papers (2023-12-01T17:17:56Z) - ML-SUPERB: Multilingual Speech Universal PERformance Benchmark [73.65853301350042]
Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks.
This paper presents multilingual SUPERB, covering 143 languages (ranging from high-resource to endangered), and considering both automatic speech recognition and language identification.
Similar to the SUPERB benchmark, we find speech SSL models can significantly improve performance compared to FBANK features.
arXiv Detail & Related papers (2023-05-18T00:01:27Z) - Exploration of Language Dependency for Japanese Self-Supervised Speech
Representation Models [18.22157315310462]
Self-supervised learning (SSL) has been dramatically successful not only in monolingual but also in cross-lingual settings.
In this paper, we investigate how effective a cross-lingual model is in comparison with a monolingual model.
We examine how much unlabeled data collected in Japanese is needed to achieve performance comparable to a cross-lingual model pre-trained with tens of thousands of hours of English and/or multilingual data.
arXiv Detail & Related papers (2023-05-09T06:28:10Z) - Lessons learned from the evaluation of Spanish Language Models [27.653133576469276]
We present a head-to-head comparison of language models for Spanish with the following results.
We argue for the need of more research to understand the factors underlying them.
The recent activity in the development of language technology for Spanish is to be welcomed, but our results show that building language models remains an open, resource-heavy problem.
arXiv Detail & Related papers (2022-12-16T10:33:38Z) - Language Contamination Explains the Cross-lingual Capabilities of
English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text.
This leads to hundreds of millions of foreign language tokens in large-scale datasets.
We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z) - Can Character-based Language Models Improve Downstream Task Performance
in Low-Resource and Noisy Language Scenarios? [0.0]
We focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi.
We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models.
arXiv Detail & Related papers (2021-10-26T14:59:16Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - On the Multilingual Capabilities of Very Large-Scale English Language
Models [0.0]
Generative Pre-trained Transformers (GPTs) have recently been scaled to unprecedented sizes in the history of machine learning.
In this work, we investigate the multilingual skills of GPT-3, focusing on one language that barely appears in the pre-training corpus, Catalan.
We find that the model shows an outstanding performance, particularly in generative tasks, with predictable limitations mostly in language understanding tasks but still with remarkable results given the zero-shot scenario.
arXiv Detail & Related papers (2021-08-30T16:18:50Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.