AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for
Indic Languages
- URL: http://arxiv.org/abs/2005.00085v1
- Date: Thu, 30 Apr 2020 20:21:02 GMT
- Title: AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for
Indic Languages
- Authors: Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N.C., Avik
Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar
- Abstract summary: We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages.
We share pre-trained word embeddings trained on these corpora.
We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks.
- Score: 15.425783311152117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present the IndicNLP corpus, a large-scale, general-domain corpus
containing 2.7 billion words for 10 Indian languages from two language
families. We share pre-trained word embeddings trained on these corpora. We
create news article category classification datasets for 9 languages to
evaluate the embeddings. We show that the IndicNLP embeddings significantly
outperform publicly available pre-trained embedding on multiple evaluation
tasks. We hope that the availability of the corpus will accelerate Indic NLP
research. The resources are available at
https://github.com/ai4bharat-indicnlp/indicnlp_corpus.
Related papers
- Low-Resource Named Entity Recognition with Cross-Lingual, Character-Level Neural Conditional Random Fields [68.17213992395041]
Low-resource named entity recognition is still an open problem in NLP.
We present a transfer learning scheme, whereby we train character-level neural CRFs to predict named entities for both high-resource languages and low resource languages jointly.
arXiv Detail & Related papers (2024-04-14T23:44:49Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - IndicBART: A Pre-trained Model for Natural Language Generation of Indic
Languages [24.638109544527104]
IndicBART is a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English.
We evaluate IndicBART on two NLG tasks: Neural Machine Translation (NMT) and extreme summarization.
arXiv Detail & Related papers (2021-09-07T07:08:33Z) - ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus [2.7036498789349244]
Researching typological properties of languages is fundamental for progress in multilingual NLP.
We provide ParCourE, an online tool that allows to browse a word-aligned parallel corpus, covering 1334 languages.
arXiv Detail & Related papers (2021-07-14T12:16:21Z) - Samanantar: The Largest Publicly Available Parallel Corpora Collection
for 11 Indic Languages [4.3857077920223295]
Samanantar is the largest publicly available parallel corpora collection for Indic languages.
The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages.
arXiv Detail & Related papers (2021-04-12T16:18:20Z) - Monolingual and Parallel Corpora for Kangri Low Resource Language [0.0]
This paper presents the dataset of Himachali low resource endangered language, Kangri (ISO 639-3xnr) listed in the United Nations Educational, Scientific and Cultural Organization (UNESCO)
The corpus contains 1,81,552 Monolingual and 27,362 Hindi-Kangri Parallel corpora.
arXiv Detail & Related papers (2021-03-22T05:52:51Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Anchor-based Bilingual Word Embeddings for Low-Resource Languages [76.48625630211943]
Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text.
MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs.
This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point.
arXiv Detail & Related papers (2020-10-23T19:17:00Z) - A Multilingual Parallel Corpora Collection Effort for Indian Languages [43.62422999765863]
We present sentence aligned parallel corpora across 10 Indian languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English.
The corpora are compiled from online sources which have content shared across languages.
arXiv Detail & Related papers (2020-07-15T14:00:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.