Human Judgement as a Compass to Navigate Automatic Metrics for Formality
Transfer
- URL: http://arxiv.org/abs/2204.07549v1
- Date: Fri, 15 Apr 2022 17:15:52 GMT
- Title: Human Judgement as a Compass to Navigate Automatic Metrics for Formality
Transfer
- Authors: Huiyuan Lai, Jiali Mao, Antonio Toral, Malvina Nissim
- Abstract summary: We focus on the task of formality transfer, and on the three aspects that are usually evaluated: style strength, content preservation, and fluency.
We offer some recommendations on the use of such metrics in formality transfer, also with an eye to their generalisability (or not) to related tasks.
- Score: 13.886432536330807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although text style transfer has witnessed rapid development in recent years,
there is as yet no established standard for evaluation, which is performed
using several automatic metrics, lacking the possibility of always resorting to
human judgement. We focus on the task of formality transfer, and on the three
aspects that are usually evaluated: style strength, content preservation, and
fluency. To cast light on how such aspects are assessed by common and new
metrics, we run a human-based evaluation and perform a rich correlation
analysis. We are then able to offer some recommendations on the use of such
metrics in formality transfer, also with an eye to their generalisability (or
not) to related tasks.
Related papers
- A Measure of the System Dependence of Automated Metrics [9.594167080604207]
We argue that it is equally important to ensure that metrics treat all systems fairly and consistently.
In this paper, we introduce a method to evaluate this aspect.
arXiv Detail & Related papers (2024-12-04T09:21:46Z) - Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - HAUSER: Towards Holistic and Automatic Evaluation of Simile Generation [18.049566239050762]
Proper evaluation metrics are like a beacon guiding the research of simile generation (SG)
To address the issues, we establish HA, a holistic and automatic evaluation system for the SG task, which consists of five criteria from three perspectives and automatic metrics for each criterion.
Our metrics are significantly more correlated with human ratings from each perspective compared with prior automatic metrics.
arXiv Detail & Related papers (2023-06-13T06:06:01Z) - Recall, Robustness, and Lexicographic Evaluation [49.13362412522523]
The application of recall without a formal evaluative motivation has led to criticism of recall as a vague or inappropriate measure.
Our analysis is composed of three tenets: recall, robustness, and lexicographic evaluation.
arXiv Detail & Related papers (2023-02-22T13:39:54Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - Evaluating the Evaluation Metrics for Style Transfer: A Case Study in
Multilingual Formality Transfer [11.259786293913606]
This work is the first multilingual evaluation of metrics in style transfer (ST)
We evaluate leading ST automatic metrics on the oft-researched task of formality style transfer.
We identify several models that correlate well with human judgments and are robust across languages.
arXiv Detail & Related papers (2021-10-20T17:21:09Z) - On the interaction of automatic evaluation and task framing in headline
style transfer [6.27489964982972]
In this paper, we propose an evaluation method for a task involving subtle textual differences, such as style transfer.
We show that it better reflects system differences than traditional metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2021-01-05T16:36:26Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.