IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation metrics for
Indian Languages
- URL: http://arxiv.org/abs/2212.10180v2
- Date: Mon, 3 Jul 2023 14:26:38 GMT
- Title: IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation metrics for
Indian Languages
- Authors: Ananya B. Sai, Vignesh Nagarajan, Tanay Dixit, Raj Dabre, Anoop
Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
- Abstract summary: We create an MQM dataset consisting of 7000 fine-grained annotations, spanning 5 Indian languages and 7 MT systems.
Our results show that pre-trained metrics, such as COMET, have the highest correlations with annotator scores.
We find that the metrics do not adequately capture fluency-based errors in Indian languages.
- Score: 25.654787264483183
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid growth of machine translation (MT) systems has necessitated
comprehensive studies to meta-evaluate evaluation metrics being used, which
enables a better selection of metrics that best reflect MT quality.
Unfortunately, most of the research focuses on high-resource languages, mainly
English, the observations for which may not always apply to other languages.
Indian languages, having over a billion speakers, are linguistically different
from English, and to date, there has not been a systematic study of evaluating
MT systems from English into Indian languages. In this paper, we fill this gap
by creating an MQM dataset consisting of 7000 fine-grained annotations,
spanning 5 Indian languages and 7 MT systems, and use it to establish
correlations between annotator scores and scores obtained using existing
automatic metrics. Our results show that pre-trained metrics, such as COMET,
have the highest correlations with annotator scores. Additionally, we find that
the metrics do not adequately capture fluency-based errors in Indian languages,
and there is a need to develop metrics focused on Indian languages. We hope
that our dataset and analysis will help promote further research in this area.
Related papers
- Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions.
We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
Our study confirms several previous findings in MT metrics research and demonstrates the dataset's value as a testbed for metric evaluation.
arXiv Detail & Related papers (2024-07-03T17:04:17Z) - Suvach -- Generated Hindi QA benchmark [0.0]
This paper proposes a new benchmark specifically designed for evaluating Hindi EQA models.
This method leverages large language models (LLMs) to generate a high-quality dataset in an extractive setting.
arXiv Detail & Related papers (2024-04-30T04:19:17Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - An approach for mistranslation removal from popular dataset for Indic MT
Task [5.4755933832880865]
We propose an algorithm to remove mistranslations from the training corpus and evaluate its performance and efficiency.
Two Indic languages (ILs), namely, Hindi (HIN) and Odia (ODI) are chosen for the experiment.
The quality of the translations in the experiment is evaluated using standard metrics such as BLEU, METEOR, and RIBES.
arXiv Detail & Related papers (2024-01-12T06:37:19Z) - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Improving Multilingual Neural Machine Translation System for Indic
Languages [0.0]
We propose a multilingual neural machine translation (MNMT) system to address the issues related to low-resource language translation.
A state-of-the-art transformer architecture is used to realize the proposed model.
Trials over a good amount of data reveal its superiority over the conventional models.
arXiv Detail & Related papers (2022-09-27T09:51:56Z) - Building Machine Translation Systems for the Next Thousand Languages [102.24310122155073]
We describe results in three research domains: building clean, web-mined datasets for 1500+ languages, developing practical MT models for under-served languages, and studying the limitations of evaluation metrics for these languages.
We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.
arXiv Detail & Related papers (2022-05-09T00:24:13Z) - A Data Bootstrapping Recipe for Low Resource Multilingual Relation
Classification [38.83366564843953]
IndoRE is a dataset with 21K entity and relation tagged gold sentences in three Indian languages, plus English.
We start with a multilingual BERT (mBERT) based system that captures entity span positions and type information.
We study the accuracy efficiency tradeoff between expensive gold instances vs. translated and aligned'silver' instances.
arXiv Detail & Related papers (2021-10-18T18:40:46Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.