Human Judgement as a Compass to Navigate Automatic Metrics for Formality
Transfer
- URL: http://arxiv.org/abs/2204.07549v1
- Date: Fri, 15 Apr 2022 17:15:52 GMT
- Title: Human Judgement as a Compass to Navigate Automatic Metrics for Formality
Transfer
- Authors: Huiyuan Lai, Jiali Mao, Antonio Toral, Malvina Nissim
- Abstract summary: We focus on the task of formality transfer, and on the three aspects that are usually evaluated: style strength, content preservation, and fluency.
We offer some recommendations on the use of such metrics in formality transfer, also with an eye to their generalisability (or not) to related tasks.
- Score: 13.886432536330807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although text style transfer has witnessed rapid development in recent years,
there is as yet no established standard for evaluation, which is performed
using several automatic metrics, lacking the possibility of always resorting to
human judgement. We focus on the task of formality transfer, and on the three
aspects that are usually evaluated: style strength, content preservation, and
fluency. To cast light on how such aspects are assessed by common and new
metrics, we run a human-based evaluation and perform a rich correlation
analysis. We are then able to offer some recommendations on the use of such
metrics in formality transfer, also with an eye to their generalisability (or
not) to related tasks.
Related papers
- Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization [52.720711541731205]
We present OpinSummEval, a dataset comprising human judgments and outputs from 14 opinion summarization models.
Our findings indicate that metrics based on neural networks generally outperform non-neural ones.
arXiv Detail & Related papers (2023-10-27T13:09:54Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - HAUSER: Towards Holistic and Automatic Evaluation of Simile Generation [18.049566239050762]
Proper evaluation metrics are like a beacon guiding the research of simile generation (SG)
To address the issues, we establish HA, a holistic and automatic evaluation system for the SG task, which consists of five criteria from three perspectives and automatic metrics for each criterion.
Our metrics are significantly more correlated with human ratings from each perspective compared with prior automatic metrics.
arXiv Detail & Related papers (2023-06-13T06:06:01Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - Towards Explainable Evaluation Metrics for Natural Language Generation [36.594817754285984]
We identify key properties and propose key goals of explainable machine translation evaluation metrics.
We conduct own novel experiments, which find that current adversarial NLP techniques are unsuitable for automatically identifying limitations of high-quality black-box evaluation metrics.
arXiv Detail & Related papers (2022-03-21T17:05:54Z) - Evaluating the Evaluation Metrics for Style Transfer: A Case Study in
Multilingual Formality Transfer [11.259786293913606]
This work is the first multilingual evaluation of metrics in style transfer (ST)
We evaluate leading ST automatic metrics on the oft-researched task of formality style transfer.
We identify several models that correlate well with human judgments and are robust across languages.
arXiv Detail & Related papers (2021-10-20T17:21:09Z) - On the interaction of automatic evaluation and task framing in headline
style transfer [6.27489964982972]
In this paper, we propose an evaluation method for a task involving subtle textual differences, such as style transfer.
We show that it better reflects system differences than traditional metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2021-01-05T16:36:26Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.