On the interaction of automatic evaluation and task framing in headline
style transfer
- URL: http://arxiv.org/abs/2101.01634v1
- Date: Tue, 5 Jan 2021 16:36:26 GMT
- Title: On the interaction of automatic evaluation and task framing in headline
style transfer
- Authors: Lorenzo De Mattei, Michele Cafagna, Huiyuan Lai, Felice Dell'Orletta,
Malvina Nissim, Albert Gatt
- Abstract summary: In this paper, we propose an evaluation method for a task involving subtle textual differences, such as style transfer.
We show that it better reflects system differences than traditional metrics such as BLEU and ROUGE.
- Score: 6.27489964982972
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: An ongoing debate in the NLG community concerns the best way to evaluate
systems, with human evaluation often being considered the most reliable method,
compared to corpus-based metrics. However, tasks involving subtle textual
differences, such as style transfer, tend to be hard for humans to perform. In
this paper, we propose an evaluation method for this task based on
purposely-trained classifiers, showing that it better reflects system
differences than traditional metrics such as BLEU and ROUGE.
Related papers
- CovScore: Evaluation of Multi-Document Abstractive Title Set Generation [16.516381474175986]
CovScore is an automatic reference-less methodology for evaluating thematic title sets.
We propose a novel methodology that decomposes quality into five main metrics along different aspects of evaluation.
arXiv Detail & Related papers (2024-07-24T16:14:15Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs? [3.1706553206969925]
We perform a meta-evaluation of such methods and assess their reliability across a broad range of tasks.
We observe that while automatic evaluation methods can approximate human ratings under specific conditions, their validity is highly context-dependent.
Our findings enhance the understanding of how automatic methods should be applied and interpreted when developing and evaluating instruction-tuned LLMs.
arXiv Detail & Related papers (2024-02-16T15:48:33Z) - Better Understanding Differences in Attribution Methods via Systematic Evaluations [57.35035463793008]
Post-hoc attribution methods have been proposed to identify image regions most influential to the models' decisions.
We propose three novel evaluation schemes to more reliably measure the faithfulness of those methods.
We use these evaluation schemes to study strengths and shortcomings of some widely used attribution methods over a wide range of models.
arXiv Detail & Related papers (2023-03-21T14:24:58Z) - Towards the evaluation of simultaneous speech translation from a
communicative perspective [0.0]
We present the results of an experiment aimed at evaluating the quality of a simultaneous speech translation engine.
We found better performance for the human interpreters in terms of intelligibility, while the machine performs slightly better in terms of informativeness.
arXiv Detail & Related papers (2021-03-15T13:09:00Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - On Learning Text Style Transfer with Direct Rewards [101.97136885111037]
Lack of parallel corpora makes it impossible to directly train supervised models for the text style transfer task.
We leverage semantic similarity metrics originally used for fine-tuning neural machine translation models.
Our model provides significant gains in both automatic and human evaluation over strong baselines.
arXiv Detail & Related papers (2020-10-24T04:30:02Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z) - PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.