Embarrassingly Easy Document-Level MT Metrics: How to Convert Any
Pretrained Metric Into a Document-Level Metric
- URL: http://arxiv.org/abs/2209.13654v1
- Date: Tue, 27 Sep 2022 19:42:22 GMT
- Title: Embarrassingly Easy Document-Level MT Metrics: How to Convert Any
Pretrained Metric Into a Document-Level Metric
- Authors: Giorgos Vernikos, Brian Thompson, Prashant Mathur, Marcello Federico
- Abstract summary: We present a method for extending pretrained metrics to incorporate context at the document level.
We show that the extended metrics outperform their sentence-level counterparts in about 85% of the tested conditions.
Our experimental results support our initial hypothesis and show that a simple extension of the metrics permits them to take advantage of context.
- Score: 15.646714712131148
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We hypothesize that existing sentence-level machine translation (MT) metrics
become less effective when the human reference contains ambiguities. To verify
this hypothesis, we present a very simple method for extending pretrained
metrics to incorporate context at the document level. We apply our method to
three popular metrics, BERTScore, Prism, and COMET, and to the reference free
metric COMET-QE. We evaluate the extended metrics on the WMT 2021 metrics
shared task using the provided MQM annotations. Our results show that the
extended metrics outperform their sentence-level counterparts in about 85% of
the tested conditions, when excluding results on low-quality human references.
Additionally, we show that our document-level extension of COMET-QE
dramatically improves its accuracy on discourse phenomena tasks, outperforming
a dedicated baseline by up to 6.1%. Our experimental results support our
initial hypothesis and show that a simple extension of the metrics permits them
to take advantage of context to resolve ambiguities in the reference.
Related papers
- Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics [46.71836180414362]
We introduce an interpretable evaluation framework for Machine Translation (MT) metrics.
Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases.
We also raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines.
arXiv Detail & Related papers (2024-10-07T16:42:10Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Improving Metrics for Speech Translation [1.2891210250935146]
We introduce Parallel Paraphrasing ($textPara_textboth$), an augmentation method for translation metrics making use of automatic paraphrasing of both the reference and hypothesis.
We show that we are able to significantly improve the correlation with human quality perception if our method is applied to commonly used metrics.
arXiv Detail & Related papers (2023-05-22T11:01:38Z) - MENLI: Robust Evaluation Metrics from Natural Language Inference [26.53850343633923]
Recently proposed BERT-based evaluation metrics for text generation perform well on standard benchmarks but are vulnerable to adversarial attacks.
We develop evaluation metrics based on Natural Language Inference (NLI)
We show that our NLI based metrics are much more robust to the attacks than the recent BERT-based metrics.
arXiv Detail & Related papers (2022-08-15T16:30:14Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - On the Intrinsic and Extrinsic Fairness Evaluation Metrics for
Contextualized Language Representations [74.70957445600936]
Multiple metrics have been introduced to measure fairness in various natural language processing tasks.
These metrics can be roughly categorized into two categories: 1) emphextrinsic metrics for evaluating fairness in downstream applications and 2) emphintrinsic metrics for estimating fairness in upstream language representation models.
arXiv Detail & Related papers (2022-03-25T22:17:43Z) - QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z) - Using Context in Neural Machine Translation Training Objectives [23.176247496139574]
We present Neural Machine Translation (NMT) training using document-level metrics with batch-level documents.
We demonstrate that training is more robust for document-level metrics than with sequence metrics.
arXiv Detail & Related papers (2020-05-04T13:42:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.