IndicIRSuite: Multilingual Dataset and Neural Information Models for
Indian Languages
- URL: http://arxiv.org/abs/2312.09508v1
- Date: Fri, 15 Dec 2023 03:19:53 GMT
- Title: IndicIRSuite: Multilingual Dataset and Neural Information Models for
Indian Languages
- Authors: Saiful Haq, Ashutosh Sharma, Pushpak Bhattacharyya
- Abstract summary: In this paper, we introduce Neural Information Retrieval resources for 11 widely spoken Indian languages.
These resources include (a) INDIC-MARCO, a multilingual version of the MSMARCO dataset in 11 Indian Languages created using Machine Translation, and (b) Indic-ColBERT, a collection of 11 distinct Monolingual Neural Information Retrieval models.
IndicIRSuite is the first attempt at building large-scale Neural Information Retrieval resources for a large number of Indian languages.
- Score: 42.50384290676914
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce Neural Information Retrieval resources for 11
widely spoken Indian Languages (Assamese, Bengali, Gujarati, Hindi, Kannada,
Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu) from two major Indian
language families (Indo-Aryan and Dravidian). These resources include (a)
INDIC-MARCO, a multilingual version of the MSMARCO dataset in 11 Indian
Languages created using Machine Translation, and (b) Indic-ColBERT, a
collection of 11 distinct Monolingual Neural Information Retrieval models, each
trained on one of the 11 languages in the INDIC-MARCO dataset. To the best of
our knowledge, IndicIRSuite is the first attempt at building large-scale Neural
Information Retrieval resources for a large number of Indian languages, and we
hope that it will help accelerate research in Neural IR for Indian Languages.
Experiments demonstrate that Indic-ColBERT achieves 47.47% improvement in the
MRR@10 score averaged over the INDIC-MARCO baselines for all 11 Indian
languages except Oriya, 12.26% improvement in the NDCG@10 score averaged over
the MIRACL Bengali and Hindi Language baselines, and 20% improvement in the
MRR@100 Score over the Mr.Tydi Bengali Language baseline. IndicIRSuite is
available at https://github.com/saifulhaq95/IndicIRSuite
Related papers
- BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages [27.273651323572786]
We evaluate the performance of widely-used Automatic Speech Translation systems on Indian languages.
There is a striking absence of systems capable of accurately translating colloquial and informal language.
We introduce BhasaAnuvaad, the largest publicly available dataset for AST involving 13 out of 22 scheduled Indian languages and English.
arXiv Detail & Related papers (2024-11-07T13:33:34Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages [6.7638050195383075]
We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian languages.
We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families.
We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
arXiv Detail & Related papers (2024-05-08T05:54:54Z) - SPRING-INX: A Multilingual Indian Language Speech Corpus by SPRING Lab,
IIT Madras [1.4699314771635081]
Building speech based applications for the Indian population is a difficult problem owing to limited data and the number of languages and accents to accommodate.
We are open sourcing SPRING-INX data which has about 2000 hours of legally sourced and manually transcribed speech data for ASR system building in Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi and Tamil.
arXiv Detail & Related papers (2023-10-23T07:50:10Z) - IndicTrans2: Towards High-Quality and Accessible Machine Translation
Models for all 22 Scheduled Indian Languages [37.758476568195256]
India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people.
22 of these languages are listed in the Constitution of India (referred to as scheduled languages)
arXiv Detail & Related papers (2023-05-25T17:57:43Z) - Summarizing Indian Languages using Multilingual Transformers based
Models [13.062351454646912]
We study how these multilingual models perform on the datasets which have Indian languages as source and target text.
We experimented with IndicBART and mT5 models to perform the experiments and report the ROUGE-1, ROUGE-2, ROUGE-3 and ROUGE-4 scores as a performance metric.
arXiv Detail & Related papers (2023-03-29T13:05:17Z) - NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages.
Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z) - Challenge Dataset of Cognates and False Friend Pairs from Indian
Languages [54.6340870873525]
Cognates are present in multiple variants of the same text across different languages.
In this paper, we describe the creation of two cognate datasets for twelve Indian languages.
arXiv Detail & Related papers (2021-12-17T14:23:43Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - A Multilingual Parallel Corpora Collection Effort for Indian Languages [43.62422999765863]
We present sentence aligned parallel corpora across 10 Indian languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English.
The corpora are compiled from online sources which have content shared across languages.
arXiv Detail & Related papers (2020-07-15T14:00:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.