Related papers: IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation metrics for Indian Languages

IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation metrics for Indian Languages

URL: http://arxiv.org/abs/2212.10180v2
Date: Mon, 3 Jul 2023 14:26:38 GMT
Title: IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation metrics for Indian Languages
Authors: Ananya B. Sai, Vignesh Nagarajan, Tanay Dixit, Raj Dabre, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
Abstract summary: We create an MQM dataset consisting of 7000 fine-grained annotations, spanning 5 Indian languages and 7 MT systems. Our results show that pre-trained metrics, such as COMET, have the highest correlations with annotator scores. We find that the metrics do not adequately capture fluency-based errors in Indian languages.
Score: 25.654787264483183
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid growth of machine translation (MT) systems has necessitated comprehensive studies to meta-evaluate evaluation metrics being used, which enables a better selection of metrics that best reflect MT quality. Unfortunately, most of the research focuses on high-resource languages, mainly English, the observations for which may not always apply to other languages. Indian languages, having over a billion speakers, are linguistically different from English, and to date, there has not been a systematic study of evaluating MT systems from English into Indian languages. In this paper, we fill this gap by creating an MQM dataset consisting of 7000 fine-grained annotations, spanning 5 Indian languages and 7 MT systems, and use it to establish correlations between annotator scores and scores obtained using existing automatic metrics. Our results show that pre-trained metrics, such as COMET, have the highest correlations with annotator scores. Additionally, we find that the metrics do not adequately capture fluency-based errors in Indian languages, and there is a need to develop metrics focused on Indian languages. We hope that our dataset and analysis will help promote further research in this area.

Related papers

Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation [57.11989521509119]
We propose a novel agentic translation evaluation framework, centered by a reflective Core Agent that invokes specialized sub-agents.<n> Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics.
arXiv Detail & Related papers (2026-01-12T09:03:42Z)
Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages [13.098470937627871]
ITEM systematically evaluates the alignment of 26 automatic metrics with human judgments across six major Indian languages.<n>Findings offer critical guidance for advancing metric design and evaluation in Indian languages.
arXiv Detail & Related papers (2025-10-08T14:27:02Z)
Crosslingual Optimized Metric for Translation Assessment of Indian Languages [3.3904531496305683]
We create a large human evaluation ratings dataset for 13 Indian languages covering 21 translation directions.<n>We then train a neural translation evaluation metric named Cross-lingual Optimized Metric for Translation Assessment of Indian Languages (COMTAIL) on this dataset.<n>The best performing metric variants show significant performance gains over previous state-of-the-art when adjudging translation pairs with at least one Indian language.
arXiv Detail & Related papers (2025-09-22T12:11:42Z)
IndicRAGSuite: Large-Scale Datasets and a Benchmark for Indian Language RAG Systems [17.88837706307504]
IndicMSMarco is a multilingual benchmark for evaluating retrieval quality and response generation in 13 Indian languages.<n>We build a large-scale dataset of (question, answer, relevant passage)s derived from the Wikipedias of 19 Indian languages using state-of-the-art LLMs.
arXiv Detail & Related papers (2025-06-02T12:55:51Z)
Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India. It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z)
Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions. We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
arXiv Detail & Related papers (2024-07-03T17:04:17Z)
Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs. This dataset aims to discover whether metrics can identify 68 translation accuracy errors. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z)
An approach for mistranslation removal from popular dataset for Indic MT Task [5.4755933832880865]
We propose an algorithm to remove mistranslations from the training corpus and evaluate its performance and efficiency. Two Indic languages (ILs), namely, Hindi (HIN) and Odia (ODI) are chosen for the experiment. The quality of the translations in the experiment is evaluated using standard metrics such as BLEU, METEOR, and RIBES.
arXiv Detail & Related papers (2024-01-12T06:37:19Z)
SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality. We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z)
Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z)
Improving Multilingual Neural Machine Translation System for Indic Languages [0.0]
We propose a multilingual neural machine translation (MNMT) system to address the issues related to low-resource language translation. A state-of-the-art transformer architecture is used to realize the proposed model. Trials over a good amount of data reveal its superiority over the conventional models.
arXiv Detail & Related papers (2022-09-27T09:51:56Z)
Building Machine Translation Systems for the Next Thousand Languages [102.24310122155073]
We describe results in three research domains: building clean, web-mined datasets for 1500+ languages, developing practical MT models for under-served languages, and studying the limitations of evaluation metrics for these languages. We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.
arXiv Detail & Related papers (2022-05-09T00:24:13Z)
A Data Bootstrapping Recipe for Low Resource Multilingual Relation Classification [38.83366564843953]
IndoRE is a dataset with 21K entity and relation tagged gold sentences in three Indian languages, plus English. We start with a multilingual BERT (mBERT) based system that captures entity span positions and type information. We study the accuracy efficiency tradeoff between expensive gold instances vs. translated and aligned'silver' instances.
arXiv Detail & Related papers (2021-10-18T18:40:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.