Learnings from Technological Interventions in a Low Resource Language: A
Case-Study on Gondi
- URL: http://arxiv.org/abs/2004.10270v2
- Date: Wed, 27 Jan 2021 03:44:44 GMT
- Title: Learnings from Technological Interventions in a Low Resource Language: A
Case-Study on Gondi
- Authors: Devansh Mehta, Sebastin Santy, Ramaravind Kommiya Mothilal, Brij Mohan
Lal Srivastava, Alok Sharma, Anurag Shukla, Vishnu Prasad, Venkanna U, Amit
Sharma, Kalika Bali
- Abstract summary: Gondi is a low-resource vulnerable language spoken by around 2.3 million tribal people in south and central India.
At the end of these interventions, we collected a little less than 12,000 translated words and/or sentences.
The larger goal of the project is collecting enough data in Gondi to build and deploy viable language technologies.
- Score: 13.9876704685177
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The primary obstacle to developing technologies for low-resource languages is
the lack of usable data. In this paper, we report the adoption and deployment
of 4 technology-driven methods of data collection for Gondi, a low-resource
vulnerable language spoken by around 2.3 million tribal people in south and
central India. In the process of data collection, we also help in its revival
by expanding access to information in Gondi through the creation of linguistic
resources that can be used by the community, such as a dictionary, children's
stories, an app with Gondi content from multiple sources and an Interactive
Voice Response (IVR) based mass awareness platform. At the end of these
interventions, we collected a little less than 12,000 translated words and/or
sentences and identified more than 650 community members whose help can be
solicited for future translation efforts. The larger goal of the project is
collecting enough data in Gondi to build and deploy viable language
technologies like machine translation and speech to text systems that can help
take the language onto the internet.
Related papers
- Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - Content-Localization based Neural Machine Translation for Informal
Dialectal Arabic: Spanish/French to Levantine/Gulf Arabic [5.2957928879391]
We propose a framework that localizes contents of high-resource languages to a low-resource language/dialects by utilizing AI power.
We are the first work to provide a parallel translation dataset from/to informal Spanish and French to/from informal Arabic dialects.
arXiv Detail & Related papers (2023-12-12T01:42:41Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Learnings from Technological Interventions in a Low Resource Language:
Enhancing Information Access in Gondi [10.096480120676878]
We create a corpus of more than 60,000 translations from Hindi to Gondi.
Gondi is a low-resource vulnerable language spoken by around 2.3 million tribal people in south and central India.
arXiv Detail & Related papers (2022-11-29T13:03:37Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Not always about you: Prioritizing community needs when developing
endangered language technology [5.670857685983896]
We discuss the unique technological, cultural, practical, and ethical challenges that researchers and indigenous speech community members face.
We report the perspectives of language teachers, Master Speakers and elders from indigenous communities, as well as the point of view of academics.
arXiv Detail & Related papers (2022-04-12T05:59:39Z) - Towards Building ASR Systems for the Next Billion Users [15.867823754118422]
We make contributions towards building ASR systems for low resource languages from the Indian subcontinent.
First, we curate 17,000 hours of raw speech data for 40 Indian languages.
Using this raw speech data we pretrain several variants of wav2vec style models for 40 Indian languages.
arXiv Detail & Related papers (2021-11-06T19:34:33Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - The first large scale collection of diverse Hausa language datasets [0.0]
Hausa is considered well-studied and documented language among the sub-Saharan African languages.
It is estimated that over 100 million people speak the language.
We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
arXiv Detail & Related papers (2021-02-13T19:34:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.