Naamapadam: A Large-Scale Named Entity Annotated Data for Indic
Languages
- URL: http://arxiv.org/abs/2212.10168v2
- Date: Sun, 28 May 2023 06:26:45 GMT
- Title: Naamapadam: A Large-Scale Named Entity Annotated Data for Indic
Languages
- Authors: Arnav Mhaske, Harshit Kedia, Sumanth Doddapaneni, Mitesh M. Khapra,
Pratyush Kumar, Rudra Murthy V, Anoop Kunchukuttan
- Abstract summary: The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories.
The training dataset has been automatically created from the Samanantar parallel corpus.
We release IndicNER, a multilingual IndicBERT model fine-tuned on Naamapadam training set.
- Score: 15.214673043019399
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present, Naamapadam, the largest publicly available Named Entity
Recognition (NER) dataset for the 11 major Indian languages from two language
families. The dataset contains more than 400k sentences annotated with a total
of at least 100k entities from three standard entity categories (Person,
Location, and, Organization) for 9 out of the 11 languages. The training
dataset has been automatically created from the Samanantar parallel corpus by
projecting automatically tagged entities from an English sentence to the
corresponding Indian language translation. We also create manually annotated
testsets for 9 languages. We demonstrate the utility of the obtained dataset on
the Naamapadam-test dataset. We also release IndicNER, a multilingual IndicBERT
model fine-tuned on Naamapadam training set. IndicNER achieves an F1 score of
more than $80$ for $7$ out of $9$ test languages. The dataset and models are
available under open-source licences at
https://ai4bharat.iitm.ac.in/naamapadam.
Related papers
- Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages [6.7638050195383075]
We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian languages.
We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families.
We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
arXiv Detail & Related papers (2024-05-08T05:54:54Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - AsNER -- Annotated Dataset and Baseline for Assamese Named Entity
recognition [7.252817150901275]
The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing.
We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition.
The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method.
arXiv Detail & Related papers (2022-07-07T16:45:55Z) - HiNER: A Large Hindi Named Entity Recognition Dataset [29.300418937509317]
This paper releases a standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags.
The statistics of tag-set in our dataset show a healthy per-tag distribution, especially for prominent classes like Person, Location and Organisation.
Our dataset helps achieve a weighted F1 score of 88.78 with all the tags and 92.22 when we collapse the tag-set, as discussed in the paper.
arXiv Detail & Related papers (2022-04-28T19:14:21Z) - IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic
Languages [23.157951796614466]
In this paper, we present the IndicNLG suite, a collection of datasets for benchmarking Natural Language Generation for 11 Indic languages.
We focus on five diverse tasks, namely, biography generation using Wikipedia infoboxes (WikiBio), news headline generation, sentence summarization, question generation and paraphrase generation.
arXiv Detail & Related papers (2022-03-10T15:53:58Z) - Challenge Dataset of Cognates and False Friend Pairs from Indian
Languages [54.6340870873525]
Cognates are present in multiple variants of the same text across different languages.
In this paper, we describe the creation of two cognate datasets for twelve Indian languages.
arXiv Detail & Related papers (2021-12-17T14:23:43Z) - CL-NERIL: A Cross-Lingual Model for NER in Indian Languages [0.5926203312586108]
This paper proposes an end-to-end framework for NER for Indian languages.
We exploit parallel corpora of English and Indian languages and an English NER dataset.
We present manually annotated test sets for three Indian languages: Hindi, Bengali, and Gujarati.
arXiv Detail & Related papers (2021-11-23T12:09:15Z) - MobIE: A German Dataset for Named Entity Recognition, Entity Linking and
Relation Extraction in the Mobility Domain [76.21775236904185]
dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities.
A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types.
To the best of our knowledge, this is the first German-language dataset that combines annotations for NER, EL and RE.
arXiv Detail & Related papers (2021-08-16T08:21:50Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.