Related papers: Human Judgement as a Compass to Navigate Automatic Metrics for Formality Transfer

Human Judgement as a Compass to Navigate Automatic Metrics for Formality Transfer

URL: http://arxiv.org/abs/2204.07549v1
Date: Fri, 15 Apr 2022 17:15:52 GMT
Title: Human Judgement as a Compass to Navigate Automatic Metrics for Formality Transfer
Authors: Huiyuan Lai, Jiali Mao, Antonio Toral, Malvina Nissim
Abstract summary: We focus on the task of formality transfer, and on the three aspects that are usually evaluated: style strength, content preservation, and fluency. We offer some recommendations on the use of such metrics in formality transfer, also with an eye to their generalisability (or not) to related tasks.
Score: 13.886432536330807
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although text style transfer has witnessed rapid development in recent years, there is as yet no established standard for evaluation, which is performed using several automatic metrics, lacking the possibility of always resorting to human judgement. We focus on the task of formality transfer, and on the three aspects that are usually evaluated: style strength, content preservation, and fluency. To cast light on how such aspects are assessed by common and new metrics, we run a human-based evaluation and perform a rich correlation analysis. We are then able to offer some recommendations on the use of such metrics in formality transfer, also with an eye to their generalisability (or not) to related tasks.

Related papers

Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks [229.73714829399802]
This survey probes the core challenges that the rise of Large Language Models poses for evaluation. We identify and analyze two pivotal transitions: (i) from task-specific to capability-based evaluation, which reorganizes benchmarks around core competencies such as knowledge, reasoning, instruction following, multi-modal understanding, and safety. We will dissect this issue, along with the core challenges of the above two transitions, from the perspectives of methods, datasets, evaluators, and metrics.
arXiv Detail & Related papers (2025-04-26T07:48:52Z)
A Measure of the System Dependence of Automated Metrics [9.594167080604207]
We argue that it is equally important to ensure that metrics treat all systems fairly and consistently. In this paper, we introduce a method to evaluate this aspect.
arXiv Detail & Related papers (2024-12-04T09:21:46Z)
Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics. This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings. We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z)
Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality. Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z)
OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization [52.720711541731205]
We present OpinSummEval, a dataset comprising human judgments and outputs from 14 opinion summarization models. Our findings indicate that metrics based on neural networks generally outperform non-neural ones.
arXiv Detail & Related papers (2023-10-27T13:09:54Z)
Learning and Evaluating Human Preferences for Conversational Head Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions. PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z)
HAUSER: Towards Holistic and Automatic Evaluation of Simile Generation [18.049566239050762]
Proper evaluation metrics are like a beacon guiding the research of simile generation (SG) To address the issues, we establish HA, a holistic and automatic evaluation system for the SG task, which consists of five criteria from three perspectives and automatic metrics for each criterion. Our metrics are significantly more correlated with human ratings from each perspective compared with prior automatic metrics.
arXiv Detail & Related papers (2023-06-13T06:06:01Z)
Recall, Robustness, and Lexicographic Evaluation [49.13362412522523]
The application of recall without a formal evaluative motivation has led to criticism of recall as a vague or inappropriate measure. Our analysis is composed of three tenets: recall, robustness, and lexicographic evaluation.
arXiv Detail & Related papers (2023-02-22T13:39:54Z)
The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics. Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z)
Towards Explainable Evaluation Metrics for Natural Language Generation [36.594817754285984]
We identify key properties and propose key goals of explainable machine translation evaluation metrics. We conduct own novel experiments, which find that current adversarial NLP techniques are unsuitable for automatically identifying limitations of high-quality black-box evaluation metrics.
arXiv Detail & Related papers (2022-03-21T17:05:54Z)
Evaluating the Evaluation Metrics for Style Transfer: A Case Study in Multilingual Formality Transfer [11.259786293913606]
This work is the first multilingual evaluation of metrics in style transfer (ST) We evaluate leading ST automatic metrics on the oft-researched task of formality style transfer. We identify several models that correlate well with human judgments and are robust across languages.
arXiv Detail & Related papers (2021-10-20T17:21:09Z)
On the interaction of automatic evaluation and task framing in headline style transfer [6.27489964982972]
In this paper, we propose an evaluation method for a task involving subtle textual differences, such as style transfer. We show that it better reflects system differences than traditional metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2021-01-05T16:36:26Z)
Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment. We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.