Related papers: Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview

Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview

URL: http://arxiv.org/abs/2010.06778v1
Date: Wed, 14 Oct 2020 02:24:04 GMT
Title: Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview
Authors: Alena Butryna and Shan-Hui Cathy Chu and Isin Demirsahin and Alexander Gutkin and Linne Ha and Fei He and Martin Jansche and Cibu Johny and Anna Katanova and Oddur Kjartansson and Chenfang Li and Tatiana Merkulova and Yin May Oo and Knot Pipatsrisawat and Clara Rivera and Supheakmungkol Sarin and Pasindu de Silva and Keshan Sodimana and Richard Sproat and Theeraphol Wattanavekin and Jaka Aris Eko Wibawa
Abstract summary: We have released 38 datasets for building text-to-speech and automatic speech recognition applications. The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language communities.
Score: 43.92114369646489
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This paper presents an overview of a program designed to address the growing need for developing freely available speech resources for under-represented languages. At present we have released 38 datasets for building text-to-speech and automatic speech recognition applications for languages and dialects of South and Southeast Asia, Africa, Europe and South America. The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language communities.

Related papers

Building low-resource African language corpora: A case study of Kidawida, Kalenjin and Dholuo [0.815557531820863]
This paper presents a case study on the development of linguistic corpora for three under-resourced Kenyan languages, Kidaw'ida, Kalenjin, and Dholuo. Our project employed a selective crowd-sourcing methodology to collect text and speech data from native speakers of these languages. We made these resources freely accessible via open-research platforms, namely Zenodo for the parallel text corpora and Mozilla Common Voice for the speech datasets.
arXiv Detail & Related papers (2025-01-19T10:17:21Z)
A Survey on Spoken Italian Datasets and Corpora [0.3222802562733787]
This survey provides a comprehensive analysis of 66 spoken Italian datasets. The datasets are categorized by speech type, source and context, and demographic and linguistic features. Challenges related to dataset scarcity, representativeness, and accessibility are discussed.
arXiv Detail & Related papers (2025-01-11T14:33:57Z)
LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models [62.47865866398233]
This white paper proposes a framework to generate linguistic tools for low-resource languages. By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity.
arXiv Detail & Related papers (2024-11-20T16:59:41Z)
Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction. Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese. We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z)
Conversations in Galician: a Large Language Model for an Underrepresented Language [2.433983268807517]
This paper introduces two novel resources designed to enhance Natural Language Processing (NLP) for the Galician language. We present a Galician adaptation of the Alpaca dataset, comprising 52,000 instructions and demonstrations. As a demonstration of the dataset utility, we fine-tuned LLaMA-7B to comprehend and respond in Galician, a language not originally supported by the model.
arXiv Detail & Related papers (2023-11-07T08:52:28Z)
Contextualising Levels of Language Resourcedness affecting Digital Processing of Text [0.5620321106679633]
We argue that the dichotomous typology LRL and HRL for all languages is problematic. The characterization is based on the typology of contextual features for each category, rather than counting tools.
arXiv Detail & Related papers (2023-09-29T07:48:24Z)
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge [57.38948190611797]
This paper proposes a novel lip reading framework, especially for low-resource languages. Since low-resource languages do not have enough video-text paired data to train the model, it is regarded as challenging to develop lip reading models for low-resource languages.
arXiv Detail & Related papers (2023-08-18T05:19:03Z)
Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages. We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources. We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z)
Resources for Turkish Natural Language Processing: A critical survey [0.0]
We review a broad range of resources, focusing on the ones that are publicly available. We present a set of recommendations, and identify gaps in the data available for conducting research and building applications in Turkish Linguistics and Natural Language Processing.
arXiv Detail & Related papers (2022-04-11T12:23:07Z)
Toward More Meaningful Resources for Lower-resourced Languages [2.3513645401551333]
We examine the contents of the names stored in Wikidata for a few lower-resourced languages. We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand annotated data. We conclude with recommended guidelines for resource development.
arXiv Detail & Related papers (2022-02-24T18:39:57Z)
Cross-lingual Transfer for Speech Processing using Acoustic Language Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide. Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages. We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.