An Analysis on Automated Metrics for Evaluating Japanese-English Chat Translation
- URL: http://arxiv.org/abs/2412.18190v1
- Date: Tue, 24 Dec 2024 05:54:40 GMT
- Title: An Analysis on Automated Metrics for Evaluating Japanese-English Chat Translation
- Authors: Andre Rusli, Makoto Shishido,
- Abstract summary: We show that for ranking NMT models in chat translations, all metrics seem consistent in deciding which model outperforms the others.
On the other hand, neural-based metrics outperform traditional metrics, with COMET achieving the highest correlation with the human-annotated score on a chat translation.
- Score: 0.0
- License:
- Abstract: This paper analyses how traditional baseline metrics, such as BLEU and TER, and neural-based methods, such as BERTScore and COMET, score several NMT models performance on chat translation and how these metrics perform when compared to human-annotated scores. The results show that for ranking NMT models in chat translations, all metrics seem consistent in deciding which model outperforms the others. This implies that traditional baseline metrics, which are faster and simpler to use, can still be helpful. On the other hand, when it comes to better correlation with human judgment, neural-based metrics outperform traditional metrics, with COMET achieving the highest correlation with the human-annotated score on a chat translation. However, we show that even the best metric struggles when scoring English translations from sentences with anaphoric zero-pronoun in Japanese.
Related papers
- Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics [46.71836180414362]
We introduce an interpretable evaluation framework for Machine Translation (MT) metrics.
Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases.
We also raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines.
arXiv Detail & Related papers (2024-10-07T16:42:10Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - The Inside Story: Towards Better Understanding of Machine Translation
Neural Evaluation Metrics [8.432864879027724]
We develop and compare several neural explainability methods and demonstrate their effectiveness for interpreting state-of-the-art fine-tuned neural metrics.
Our study reveals that these metrics leverage token-level information that can be directly attributed to translation errors.
arXiv Detail & Related papers (2023-05-19T16:42:17Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - NMTScore: A Multilingual Analysis of Translation-based Text Similarity
Measures [42.46681912294797]
We analyze translation-based similarity measures in the common framework of multilingual NMT.
Compared to baselines such as sentence embeddings, translation-based measures prove competitive in paraphrase identification.
Measures show a relatively high correlation to human judgments.
arXiv Detail & Related papers (2022-04-28T17:57:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.