Multilingual Coreference Resolution in Low-resource South Asian Languages
- URL: http://arxiv.org/abs/2402.13571v2
- Date: Sat, 23 Mar 2024 08:22:58 GMT
- Title: Multilingual Coreference Resolution in Low-resource South Asian Languages
- Authors: Ritwik Mishra, Pooja Desur, Rajiv Ratn Shah, Ponnurangam Kumaraguru,
- Abstract summary: We introduce a Translated dataset for Multilingual Coreference Resolution (TransMuCoRes) in 31 South Asian languages.
Nearly all of the predicted translations successfully pass a sanity check, and 75% of English references align with their predicted translations.
This study is the first to evaluate an end-to-end coreference resolution model on a Hindi golden set.
- Score: 36.31301773167754
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Coreference resolution involves the task of identifying text spans within a discourse that pertain to the same real-world entity. While this task has been extensively explored in the English language, there has been a notable scarcity of publicly accessible resources and models for coreference resolution in South Asian languages. We introduce a Translated dataset for Multilingual Coreference Resolution (TransMuCoRes) in 31 South Asian languages using off-the-shelf tools for translation and word-alignment. Nearly all of the predicted translations successfully pass a sanity check, and 75% of English references align with their predicted translations. Using multilingual encoders, two off-the-shelf coreference resolution models were trained on a concatenation of TransMuCoRes and a Hindi coreference resolution dataset with manual annotations. The best performing model achieved a score of 64 and 68 for LEA F1 and CoNLL F1, respectively, on our test-split of Hindi golden set. This study is the first to evaluate an end-to-end coreference resolution model on a Hindi golden set. Furthermore, this work underscores the limitations of current coreference evaluation metrics when applied to datasets with split antecedents, advocating for the development of more suitable evaluation metrics.
Related papers
- Exploring Multiple Strategies to Improve Multilingual Coreference Resolution in CorefUD [0.0]
This paper presents our end-to-end neural coreference resolution system.
We first establish strong baseline models, including monolingual and cross-lingual variations.
We propose several extensions to enhance performance across diverse linguistic contexts.
arXiv Detail & Related papers (2024-08-29T20:27:05Z) - MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering [0.4194295877935868]
This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-resource languages.
We introduce MahaSQuAD, the first-ever full SQuAD dataset for the Indic language Marathi, consisting of 118,516 training, 11,873 validation, and 11,803 test samples.
arXiv Detail & Related papers (2024-04-20T12:16:35Z) - Translation Errors Significantly Impact Low-Resource Languages in
Cross-Lingual Learning [26.49647954587193]
In this work, we find that translation inconsistencies do exist and they disproportionally impact low-resource languages in XNLI.
To identify such inconsistencies, we propose measuring the gap in performance between zero-shot evaluations on the human-translated and machine-translated target text.
We also corroborate that translation errors exist for two target languages, namely Hindi and Urdu, by doing a manual reannotation of human-translated test instances.
arXiv Detail & Related papers (2024-02-03T08:22:51Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Self-Attention with Cross-Lingual Position Representation [112.05807284056337]
Position encoding (PE) is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences.
Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem.
We augment SANs with emphcross-lingual position representations to model the bilingually aware latent structure for the input sentence.
arXiv Detail & Related papers (2020-04-28T05:23:43Z) - Practical Comparable Data Collection for Low-Resource Languages via
Images [126.64069379167975]
We propose a method of curating high-quality comparable training data for low-resource languages with monolingual annotators.
Our method involves using a carefully selected set of images as a pivot between the source and target languages by getting captions for such images in both languages independently.
Human evaluations on the English-Hindi comparable corpora created with our method show that 81.1% of the pairs are acceptable translations, and only 2.47% of the pairs are not translations at all.
arXiv Detail & Related papers (2020-04-24T19:30:38Z) - Investigating Language Impact in Bilingual Approaches for Computational
Language Documentation [28.838960956506018]
This paper investigates how the choice of translation language affects the posterior documentation work.
We create 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment.
Our results suggest that incorporating clues into the neural models' input representation increases their translation and alignment quality.
arXiv Detail & Related papers (2020-03-30T10:30:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.