Related papers: Evaluating Self-Supervised Speech Representations for Indigenous American Languages

Evaluating Self-Supervised Speech Representations for Indigenous American Languages

URL: http://arxiv.org/abs/2310.03639v2
Date: Sun, 8 Oct 2023 23:28:50 GMT
Title: Evaluating Self-Supervised Speech Representations for Indigenous American Languages
Authors: Chih-Chen Chen, William Chen, Rodolfo Zevallos, John E. Ortega
Abstract summary: We present an ASR corpus for Quechua, an indigenous South American Language. We benchmark the efficacy of large SSL models on Quechua, along with 6 other indigenous languages such as Guarani and Bribri, on low-resource ASR. Our results show surprisingly strong performance by state-of-the-art SSL models, showing the potential generalizability of large-scale models to real-world data.
Score: 6.235388047623929
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The application of self-supervision to speech representation learning has garnered significant interest in recent years, due to its scalability to large amounts of unlabeled data. However, much progress, both in terms of pre-training and downstream evaluation, has remained concentrated in monolingual models that only consider English. Few models consider other languages, and even fewer consider indigenous ones. In our submission to the New Language Track of the ASRU 2023 ML-SUPERB Challenge, we present an ASR corpus for Quechua, an indigenous South American Language. We benchmark the efficacy of large SSL models on Quechua, along with 6 other indigenous languages such as Guarani and Bribri, on low-resource ASR. Our results show surprisingly strong performance by state-of-the-art SSL models, showing the potential generalizability of large-scale models to real-world data.

Related papers

Adapting Language Models to Indonesian Local Languages: An Empirical Study of Language Transferability on Zero-Shot Settings [1.1556013985948772]
We evaluate transferability of pre-trained language models to low-resource Indonesian local languages.<n>We group the target languages into three categories: seen, partially seen, and unseen.<n> Multilingual models perform best on seen languages, moderately on partially seen ones, and poorly on unseen languages.<n>We find that MAD-X significantly improves performance, especially for seen and partially seen languages, without requiring labeled data in the target language.
arXiv Detail & Related papers (2025-07-02T12:17:55Z)
Self-supervised Speech Representations Still Struggle with African American Vernacular English [28.223877889211803]
Underperformance of ASR systems for speakers of marginalized language varieties is a well-documented phenomenon. We investigate whether or not the recent wave of Self-Supervised Learning speech models can close the gap in ASR performance between AAVE and Mainstream American English.
arXiv Detail & Related papers (2024-08-26T13:29:25Z)
Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised Models [60.09618700199927]
We propose adaptation methods which integrate LoRA to existed SSL models to extend new language. We also develop preservation strategies which include data combination and re-clustering to retain abilities on existed languages.
arXiv Detail & Related papers (2024-06-20T08:13:30Z)
SeaLLMs -- Large Language Models for Southeast Asia [76.50157503379086]
We introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning. Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities.
arXiv Detail & Related papers (2023-12-01T17:17:56Z)
ML-SUPERB: Multilingual Speech Universal PERformance Benchmark [73.65853301350042]
Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks. This paper presents multilingual SUPERB, covering 143 languages (ranging from high-resource to endangered), and considering both automatic speech recognition and language identification. Similar to the SUPERB benchmark, we find speech SSL models can significantly improve performance compared to FBANK features.
arXiv Detail & Related papers (2023-05-18T00:01:27Z)
Lessons learned from the evaluation of Spanish Language Models [27.653133576469276]
We present a head-to-head comparison of language models for Spanish with the following results. We argue for the need of more research to understand the factors underlying them. The recent activity in the development of language technology for Spanish is to be welcomed, but our results show that building language models remains an open, resource-heavy problem.
arXiv Detail & Related papers (2022-12-16T10:33:38Z)
Language Contamination Explains the Cross-lingual Capabilities of English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text. This leads to hundreds of millions of foreign language tokens in large-scale datasets. We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z)
Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios? [0.0]
We focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models.
arXiv Detail & Related papers (2021-10-26T14:59:16Z)
Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages. We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z)
On the Multilingual Capabilities of Very Large-Scale English Language Models [0.0]
Generative Pre-trained Transformers (GPTs) have recently been scaled to unprecedented sizes in the history of machine learning. In this work, we investigate the multilingual skills of GPT-3, focusing on one language that barely appears in the pre-training corpus, Catalan. We find that the model shows an outstanding performance, particularly in generative tasks, with predictable limitations mostly in language understanding tasks but still with remarkable results given the zero-shot scenario.
arXiv Detail & Related papers (2021-08-30T16:18:50Z)
Unsupervised Cross-lingual Representation Learning for Speech Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations. Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.