Is Context Helpful for Chat Translation Evaluation?
- URL: http://arxiv.org/abs/2403.08314v1
- Date: Wed, 13 Mar 2024 07:49:50 GMT
- Title: Is Context Helpful for Chat Translation Evaluation?
- Authors: Sweta Agrawal, Amin Farajian, Patrick Fernandes, Ricardo Rei, Andr\'e
F.T. Martins
- Abstract summary: We conduct a meta-evaluation of existing sentence-level automatic metrics to assess the quality of machine-translated chats.
We find that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings.
We propose a new evaluation metric, Context-MQM, that utilizes bilingual context with a large language model.
- Score: 23.440392979857247
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the recent success of automatic metrics for assessing translation
quality, their application in evaluating the quality of machine-translated
chats has been limited. Unlike more structured texts like news, chat
conversations are often unstructured, short, and heavily reliant on contextual
information. This poses questions about the reliability of existing
sentence-level metrics in this domain as well as the role of context in
assessing the translation quality. Motivated by this, we conduct a
meta-evaluation of existing sentence-level automatic metrics, primarily
designed for structured domains such as news, to assess the quality of
machine-translated chats. We find that reference-free metrics lag behind
reference-based ones, especially when evaluating translation quality in
out-of-English settings. We then investigate how incorporating conversational
contextual information in these metrics affects their performance. Our findings
show that augmenting neural learned metrics with contextual information helps
improve correlation with human judgments in the reference-free scenario and
when evaluating translations in out-of-English settings. Finally, we propose a
new evaluation metric, Context-MQM, that utilizes bilingual context with a
large language model (LLM) and further validate that adding context helps even
for LLM-based evaluation metrics.
Related papers
- Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics [46.71836180414362]
We introduce an interpretable evaluation framework for Machine Translation (MT) metrics.
Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases.
We also raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines.
arXiv Detail & Related papers (2024-10-07T16:42:10Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Towards Explainable Evaluation Metrics for Natural Language Generation [36.594817754285984]
We identify key properties and propose key goals of explainable machine translation evaluation metrics.
We conduct own novel experiments, which find that current adversarial NLP techniques are unsuitable for automatically identifying limitations of high-quality black-box evaluation metrics.
arXiv Detail & Related papers (2022-03-21T17:05:54Z) - BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing
Critical Translation Errors in Sentiment-oriented Text [1.4213973379473654]
Machine Translation (MT) of the online content is commonly used to process posts written in several languages.
In this paper, we assess the ability of automatic quality metrics to detect critical machine translation errors.
We conclude that there is a need for fine-tuning of automatic metrics to make them more robust in detecting sentiment critical errors.
arXiv Detail & Related papers (2021-09-29T07:51:17Z) - Measuring and Increasing Context Usage in Context-Aware Machine
Translation [64.5726087590283]
We introduce a new metric, conditional cross-mutual information, to quantify the usage of context by machine translation models.
We then introduce a new, simple training method, context-aware word dropout, to increase the usage of context by context-aware models.
arXiv Detail & Related papers (2021-05-07T19:55:35Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z) - Can Your Context-Aware MT System Pass the DiP Benchmark Tests? :
Evaluation Benchmarks for Discourse Phenomena in Machine Translation [7.993547048820065]
We introduce the first of their kind MT benchmark datasets that aim to track and hail improvements across four main discourse phenomena.
Surprisingly, we find that existing context-aware models do not improve discourse-related translations consistently across languages and phenomena.
arXiv Detail & Related papers (2020-04-30T07:15:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.