How Good is Zero-Shot MT Evaluation for Low Resource Indian Languages?
- URL: http://arxiv.org/abs/2406.03893v1
- Date: Thu, 6 Jun 2024 09:28:08 GMT
- Title: How Good is Zero-Shot MT Evaluation for Low Resource Indian Languages?
- Authors: Anushka Singh, Ananya B. Sai, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Mitesh M Khapra,
- Abstract summary: We focus on a zero-shot evaluation setting focusing on low-resource Indian languages, namely Assamese, Kannada, Maithili, and Punjabi.
We observe that even for learned metrics, which are known to exhibit zero-shot performance, the Kendall Tau and Pearson correlations with human annotations are only as high as 0.32 and 0.45.
- Score: 35.368257850926184
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While machine translation evaluation has been studied primarily for high-resource languages, there has been a recent interest in evaluation for low-resource languages due to the increasing availability of data and models. In this paper, we focus on a zero-shot evaluation setting focusing on low-resource Indian languages, namely Assamese, Kannada, Maithili, and Punjabi. We collect sufficient Multi-Dimensional Quality Metrics (MQM) and Direct Assessment (DA) annotations to create test sets and meta-evaluate a plethora of automatic evaluation metrics. We observe that even for learned metrics, which are known to exhibit zero-shot performance, the Kendall Tau and Pearson correlations with human annotations are only as high as 0.32 and 0.45. Synthetic data approaches show mixed results and overall do not help close the gap by much for these languages. This indicates that there is still a long way to go for low-resource evaluation.
Related papers
- On the Evaluation Practices in Multilingual NLP: Can Machine Translation Offer an Alternative to Human Translations? [19.346078451375693]
We present an analysis of existing evaluation frameworks in NLP.
We propose several directions for more robust and reliable evaluation practices.
We show that simpler baselines can achieve relatively strong performance without having benefited from large-scale multilingual pretraining.
arXiv Detail & Related papers (2024-06-20T12:46:12Z) - An Empirical Study on the Robustness of Massively Multilingual Neural Machine Translation [40.08063412966712]
Massively multilingual neural machine translation (MMNMT) has been proven to enhance the translation quality of low-resource languages.
We create a robustness evaluation benchmark dataset for Indonesian-Chinese translation.
This dataset is automatically translated into Chinese using four NLLB-200 models of different sizes.
arXiv Detail & Related papers (2024-05-13T12:01:54Z) - Suvach -- Generated Hindi QA benchmark [0.0]
This paper proposes a new benchmark specifically designed for evaluating Hindi EQA models.
This method leverages large language models (LLMs) to generate a high-quality dataset in an extractive setting.
arXiv Detail & Related papers (2024-04-30T04:19:17Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - A Novel Self-training Approach for Low-resource Speech Recognition [15.612232220719653]
We propose a self-training approach for automatic speech recognition (ASR) for low-resource settings.
Our approach significantly improves word error rate, achieving a relative improvement of 14.94%.
Our proposed approach reports the best results on the Common Voice Punjabi dataset.
arXiv Detail & Related papers (2023-08-10T01:02:45Z) - Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood.
We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z) - IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation metrics for
Indian Languages [25.654787264483183]
We create an MQM dataset consisting of 7000 fine-grained annotations, spanning 5 Indian languages and 7 MT systems.
Our results show that pre-trained metrics, such as COMET, have the highest correlations with annotator scores.
We find that the metrics do not adequately capture fluency-based errors in Indian languages.
arXiv Detail & Related papers (2022-12-20T11:37:22Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - AmericasNLI: Evaluating Zero-shot Natural Language Understanding of
Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas.
We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches.
We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.