Related papers: A Multi-way Parallel Named Entity Annotated Corpus for English, Tamil and Sinhala

A Multi-way Parallel Named Entity Annotated Corpus for English, Tamil and Sinhala

URL: http://arxiv.org/abs/2412.02056v2
Date: Tue, 14 Jan 2025 21:02:56 GMT
Title: A Multi-way Parallel Named Entity Annotated Corpus for English, Tamil and Sinhala
Authors: Surangika Ranathunga, Asanka Ranasinghea, Janaka Shamala, Ayodya Dandeniyaa, Rashmi Galappaththia, Malithi Samaraweeraa,
Abstract summary: This paper presents a parallel English-Tamil-Sinhala corpus annotated with Named Entities (NEs)<n>Using pre-trained multilingual Language Models (mLMs), we establish new benchmark Named Entity Recognition (NER) results on this dataset for Sinhala and Tamil.
Score: 0.8675380166590487
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents a multi-way parallel English-Tamil-Sinhala corpus annotated with Named Entities (NEs), where Sinhala and Tamil are low-resource languages. Using pre-trained multilingual Language Models (mLMs), we establish new benchmark Named Entity Recognition (NER) results on this dataset for Sinhala and Tamil. We also carry out a detailed investigation on the NER capabilities of different types of mLMs. Finally, we demonstrate the utility of our NER system on a low-resource Neural Machine Translation (NMT) task. Our dataset is publicly released: https://github.com/suralk/multiNER.

Related papers

"I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities [59.22329574700317]
Spoken named entity recognition (NER) aims to identify named entities from speech. New named entities appear every day, however, annotating their Spoken NER data is costly. We propose a method for generating Spoken NER data based on a named entity dictionary (NED) to reduce costs.
arXiv Detail & Related papers (2024-12-26T07:43:18Z)
LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries. Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z)
Enhancing Low Resource NER Using Assisting Language And Transfer Learning [0.7340017786387767]
We use baseBERT, AlBERT, and RoBERTa to train a supervised NER model. We show that models trained using multiple languages perform better than a single language.
arXiv Detail & Related papers (2023-06-10T16:31:04Z)
MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for Natural Language Understanding in Task-Oriented Dialogue [115.32009638844059]
We extend the English only NLU++ dataset to include manual translations into a range of high, medium, and low resource languages. Because of its multi-intent property, MULTI3NLU++ represents complex and natural user goals. We use MULTI3NLU++ to benchmark state-of-the-art multilingual models for the Natural Language Understanding tasks of intent detection and slot labelling.
arXiv Detail & Related papers (2022-12-20T17:34:25Z)
MANER: Mask Augmented Named Entity Recognition for Extreme Low-Resource Languages [27.812329651072343]
We introduce Mask Augmented Named Entity Recognition (MANER) for low-resource languages. MANER re-purposes the mask> token for NER prediction. Specifically, we prepend the mask> token to every word in a sentence for which we would like to predict the named entity tag. Experiments show that for 100 languages with as few as 100 training examples, it improves on state-of-the-art methods by up to 48% and by 12% on average on F1 score.
arXiv Detail & Related papers (2022-12-19T18:49:50Z)
AsNER -- Annotated Dataset and Baseline for Assamese Named Entity recognition [7.252817150901275]
The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing. We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition. The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method.
arXiv Detail & Related papers (2022-07-07T16:45:55Z)
L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models [0.7874708385247353]
We focus on Marathi, an Indian language, spoken prominently by the people of Maharashtra state. We present L3Cube-MahaNER, the first major gold standard named entity recognition dataset in Marathi. In the end, we benchmark the dataset on different CNN, LSTM, and Transformer based models like mBERT, XLM-RoBERTa, IndicBERT, MahaBERT, etc.
arXiv Detail & Related papers (2022-04-12T18:32:15Z)
Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named Entity Recognition [0.7874708385247353]
We consider NER for low-resource Indian languages like Hindi and Marathi. We consider different variations of BERT like base-BERT, RoBERTa, and AlBERT and benchmark them on publicly available Hindi and Marathi NER datasets. We show that the monolingual MahaRoBERTa model performs the best for Marathi NER whereas the multilingual XLM-RoBERTa performs the best for Hindi NER.
arXiv Detail & Related papers (2022-03-24T07:50:41Z)
Reinforced Iterative Knowledge Distillation for Cross-Lingual Named Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources. Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages. We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z)
Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem. For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token. We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z)
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus. Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.