Language Resources and Technologies for Non-Scheduled and Endangered
Indian Languages
- URL: http://arxiv.org/abs/2204.02822v1
- Date: Wed, 6 Apr 2022 13:33:24 GMT
- Title: Language Resources and Technologies for Non-Scheduled and Endangered
Indian Languages
- Authors: Ritesh Kumar, Bornini Lahiri
- Abstract summary: Survey of language resources and technologies available for non-scheduled and endangered languages of India.
Barring some of the 22 languages included in the 8th Schedule of the Indian Constitution, there is hardly any substantial resource or technology available for the rest of the languages.
- Score: 0.9137554315375919
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the present paper, we will present a survey of the language resources and
technologies available for the non-scheduled and endangered languages of India.
While there have been different estimates from different sources about the
number of languages in India, it could be assumed that there are more than
1,000 languages currently being spoken in India. However barring some of the 22
languages included in the 8th Schedule of the Indian Constitution (called the
scheduled languages), there is hardly any substantial resource or technology
available for the rest of the languages. Nonetheless there have been some
individual attempts at developing resources and technologies for the different
languages across the country. Of late, some financial support has also become
available for the endangered languages. In this paper, we give a summary of the
resources and technologies for those Indian languages which are not included in
the 8th schedule of the Indian Constitution and/or which are endangered.
Related papers
- Indigenous Languages Spoken in Argentina: A Survey of NLP and Speech Resources [45.07333085270152]
Argentina has a large yet little-known Indigenous linguistic diversity, encompassing at least 40 different languages.
We present a systematization of the Indigenous languages spoken in Argentina, classifying them into seven language families.
For each one, we present an estimation of the national Indigenous population size, based on the most recent Argentinian census.
arXiv Detail & Related papers (2025-01-17T03:47:19Z) - Survey of Pseudonymization, Abstractive Summarization & Spell Checker for Hindi and Marathi [0.0]
The paper aims to build a platform which enables the user to use various features like text anonymization, abstractive text summarization and spell checking in English, Hindi and Marathi language.
The aim of these tools is to serve enterprise and consumer clients who predominantly use Indian Regional languages.
arXiv Detail & Related papers (2024-12-24T04:51:32Z) - A Review of the Marathi Natural Language Processing [0.0]
This paper presents a broad overview of evolution of NLP research in Indic languages.
It focuses on Marathi and state-of-the-art resources and tools available to the research community.
arXiv Detail & Related papers (2024-12-20T00:56:13Z) - LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models [62.47865866398233]
This white paper proposes a framework to generate linguistic tools for low-resource languages.
By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity.
arXiv Detail & Related papers (2024-11-20T16:59:41Z) - BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages [27.273651323572786]
We evaluate the performance of widely-used Automatic Speech Translation systems on Indian languages.
There is a striking absence of systems capable of accurately translating colloquial and informal language.
We introduce BhasaAnuvaad, the largest publicly available dataset for AST involving 13 out of 22 scheduled Indian languages and English.
arXiv Detail & Related papers (2024-11-07T13:33:34Z) - IndicVoices: Towards building an Inclusive Multilingual Speech Dataset
for Indian Languages [17.862027695142825]
INDICVOICES is a dataset of natural and spontaneous speech from 16237 speakers covering 145 Indian districts and 22 languages.
1639 hours have already been transcribed, with a median of 73 hours per language.
All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available.
arXiv Detail & Related papers (2024-03-04T10:42:08Z) - SPRING-INX: A Multilingual Indian Language Speech Corpus by SPRING Lab,
IIT Madras [1.4699314771635081]
Building speech based applications for the Indian population is a difficult problem owing to limited data and the number of languages and accents to accommodate.
We are open sourcing SPRING-INX data which has about 2000 hours of legally sourced and manually transcribed speech data for ASR system building in Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi and Tamil.
arXiv Detail & Related papers (2023-10-23T07:50:10Z) - GlobalBench: A Benchmark for Global Progress in Natural Language
Processing [114.24519009839142]
GlobalBench aims to track progress on all NLP datasets in all languages.
Tracks estimated per-speaker utility and equity of technology across all languages.
Currently, GlobalBench covers 966 datasets in 190 languages, and has 1,128 system submissions spanning 62 languages.
arXiv Detail & Related papers (2023-05-24T04:36:32Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Factorization of Fact-Checks for Low Resource Indian Languages [44.94080515860928]
We introduce FactDRIL: the first large scale multilingual Fact-checking dataset for Regional Indian languages.
Our dataset consists of 9,058 samples belonging to English, 5,155 samples to Hindi and remaining 8,222 samples are distributed across various regional languages.
We expect this dataset will be a valuable resource and serve as a starting point to fight proliferation of fake news in low resource languages.
arXiv Detail & Related papers (2021-02-23T16:47:41Z) - A Multilingual Parallel Corpora Collection Effort for Indian Languages [43.62422999765863]
We present sentence aligned parallel corpora across 10 Indian languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English.
The corpora are compiled from online sources which have content shared across languages.
arXiv Detail & Related papers (2020-07-15T14:00:18Z) - A Summary of the First Workshop on Language Technology for Language
Documentation and Revitalization [70.14668193220528]
In August 2019, a workshop was held at Carnegie Mellon University to attempt to bring together language community members, documentary linguists, and technologists.
This paper reports the results of the workshop, including issues discussed, and various conceived and implemented technologies for nine languages.
arXiv Detail & Related papers (2020-04-27T22:55:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.