Learnings from Technological Interventions in a Low Resource Language:
Enhancing Information Access in Gondi
- URL: http://arxiv.org/abs/2211.16172v1
- Date: Tue, 29 Nov 2022 13:03:37 GMT
- Title: Learnings from Technological Interventions in a Low Resource Language:
Enhancing Information Access in Gondi
- Authors: Devansh Mehta, Harshita Diddee, Ananya Saxena, Anurag Shukla, Sebastin
Santy, Ramaravind Kommiya Mothilal, Brij Mohan Lal Srivastava, Alok Sharma,
Vishnu Prasad, Venkanna U, Kalika Bali
- Abstract summary: We create a corpus of more than 60,000 translations from Hindi to Gondi.
Gondi is a low-resource vulnerable language spoken by around 2.3 million tribal people in south and central India.
- Score: 10.096480120676878
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The primary obstacle to developing technologies for low-resource languages is
the lack of representative, usable data. In this paper, we report the
deployment of technology-driven data collection methods for creating a corpus
of more than 60,000 translations from Hindi to Gondi, a low-resource vulnerable
language spoken by around 2.3 million tribal people in south and central India.
During this process, we help expand information access in Gondi across 2
different dimensions (a) The creation of linguistic resources that can be used
by the community, such as a dictionary, children's stories, Gondi translations
from multiple sources and an Interactive Voice Response (IVR) based mass
awareness platform; (b) Enabling its use in the digital domain by developing a
Hindi-Gondi machine translation model, which is compressed by nearly 4 times to
enable it's edge deployment on low-resource edge devices and in areas of little
to no internet connectivity. We also present preliminary evaluations of
utilizing the developed machine translation model to provide assistance to
volunteers who are involved in collecting more data for the target language.
Through these interventions, we not only created a refined and evaluated corpus
of 26,240 Hindi-Gondi translations that was used for building the translation
model but also engaged nearly 850 community members who can help take Gondi
onto the internet.
Related papers
- Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z) - A Tulu Resource for Machine Translation [3.038642416291856]
We present the first parallel dataset for English-Tulu translation.
Tulu is spoken by approximately 2.5 million individuals in southwestern India.
Our English-Tulu system, trained without using parallel English-Tulu data, outperforms Google Translate by 19 BLEU points.
arXiv Detail & Related papers (2024-03-28T04:30:07Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Towards Building ASR Systems for the Next Billion Users [15.867823754118422]
We make contributions towards building ASR systems for low resource languages from the Indian subcontinent.
First, we curate 17,000 hours of raw speech data for 40 Indian languages.
Using this raw speech data we pretrain several variants of wav2vec style models for 40 Indian languages.
arXiv Detail & Related papers (2021-11-06T19:34:33Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - The first large scale collection of diverse Hausa language datasets [0.0]
Hausa is considered well-studied and documented language among the sub-Saharan African languages.
It is estimated that over 100 million people speak the language.
We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
arXiv Detail & Related papers (2021-02-13T19:34:20Z) - Learnings from Technological Interventions in a Low Resource Language: A
Case-Study on Gondi [13.9876704685177]
Gondi is a low-resource vulnerable language spoken by around 2.3 million tribal people in south and central India.
At the end of these interventions, we collected a little less than 12,000 translated words and/or sentences.
The larger goal of the project is collecting enough data in Gondi to build and deploy viable language technologies.
arXiv Detail & Related papers (2020-04-21T20:03:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.