Google Crowdsourced Speech Corpora and Related Open-Source Resources for
Low-Resource Languages and Dialects: An Overview
- URL: http://arxiv.org/abs/2010.06778v1
- Date: Wed, 14 Oct 2020 02:24:04 GMT
- Title: Google Crowdsourced Speech Corpora and Related Open-Source Resources for
Low-Resource Languages and Dialects: An Overview
- Authors: Alena Butryna and Shan-Hui Cathy Chu and Isin Demirsahin and Alexander
Gutkin and Linne Ha and Fei He and Martin Jansche and Cibu Johny and Anna
Katanova and Oddur Kjartansson and Chenfang Li and Tatiana Merkulova and Yin
May Oo and Knot Pipatsrisawat and Clara Rivera and Supheakmungkol Sarin and
Pasindu de Silva and Keshan Sodimana and Richard Sproat and Theeraphol
Wattanavekin and Jaka Aris Eko Wibawa
- Abstract summary: We have released 38 datasets for building text-to-speech and automatic speech recognition applications.
The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language communities.
- Score: 43.92114369646489
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper presents an overview of a program designed to address the growing
need for developing freely available speech resources for under-represented
languages. At present we have released 38 datasets for building text-to-speech
and automatic speech recognition applications for languages and dialects of
South and Southeast Asia, Africa, Europe and South America. The paper describes
the methodology used for developing such corpora and presents some of our
findings that could benefit under-represented language communities.
Related papers
- Conversations in Galician: a Large Language Model for an
Underrepresented Language [2.433983268807517]
This paper introduces two novel resources designed to enhance Natural Language Processing (NLP) for the Galician language.
We present a Galician adaptation of the Alpaca dataset, comprising 52,000 instructions and demonstrations.
As a demonstration of the dataset utility, we fine-tuned LLaMA-7B to comprehend and respond in Galician, a language not originally supported by the model.
arXiv Detail & Related papers (2023-11-07T08:52:28Z) - Contextualising Levels of Language Resourcedness affecting Digital
Processing of Text [0.5620321106679633]
We argue that the dichotomous typology LRL and HRL for all languages is problematic.
The characterization is based on the typology of contextual features for each category, rather than counting tools.
arXiv Detail & Related papers (2023-09-29T07:48:24Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Lip Reading for Low-resource Languages by Learning and Combining General
Speech Knowledge and Language-specific Knowledge [57.38948190611797]
This paper proposes a novel lip reading framework, especially for low-resource languages.
Since low-resource languages do not have enough video-text paired data to train the model, it is regarded as challenging to develop lip reading models for low-resource languages.
arXiv Detail & Related papers (2023-08-18T05:19:03Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - Resources for Turkish Natural Language Processing: A critical survey [0.0]
We review a broad range of resources, focusing on the ones that are publicly available.
We present a set of recommendations, and identify gaps in the data available for conducting research and building applications in Turkish Linguistics and Natural Language Processing.
arXiv Detail & Related papers (2022-04-11T12:23:07Z) - Toward More Meaningful Resources for Lower-resourced Languages [2.3513645401551333]
We examine the contents of the names stored in Wikidata for a few lower-resourced languages.
We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand annotated data.
We conclude with recommended guidelines for resource development.
arXiv Detail & Related papers (2022-02-24T18:39:57Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.