HAUSER: Towards Holistic and Automatic Evaluation of Simile Generation
- URL: http://arxiv.org/abs/2306.07554v1
- Date: Tue, 13 Jun 2023 06:06:01 GMT
- Title: HAUSER: Towards Holistic and Automatic Evaluation of Simile Generation
- Authors: Qianyu He, Yikai Zhang, Jiaqing Liang, Yuncheng Huang, Yanghua Xiao,
Yunwen Chen
- Abstract summary: Proper evaluation metrics are like a beacon guiding the research of simile generation (SG)
To address the issues, we establish HA, a holistic and automatic evaluation system for the SG task, which consists of five criteria from three perspectives and automatic metrics for each criterion.
Our metrics are significantly more correlated with human ratings from each perspective compared with prior automatic metrics.
- Score: 18.049566239050762
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Similes play an imperative role in creative writing such as story and
dialogue generation. Proper evaluation metrics are like a beacon guiding the
research of simile generation (SG). However, it remains under-explored as to
what criteria should be considered, how to quantify each criterion into
metrics, and whether the metrics are effective for comprehensive, efficient,
and reliable SG evaluation. To address the issues, we establish HAUSER, a
holistic and automatic evaluation system for the SG task, which consists of
five criteria from three perspectives and automatic metrics for each criterion.
Through extensive experiments, we verify that our metrics are significantly
more correlated with human ratings from each perspective compared with prior
automatic metrics.
Related papers
- What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation [57.550045763103334]
evaluating a story can be more challenging than other generation evaluation tasks.
We first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual.
We propose a taxonomy to organize evaluation metrics that have been developed or can be adopted for story evaluation.
arXiv Detail & Related papers (2024-08-26T20:35:42Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization [52.720711541731205]
We present OpinSummEval, a dataset comprising human judgments and outputs from 14 opinion summarization models.
Our findings indicate that metrics based on neural networks generally outperform non-neural ones.
arXiv Detail & Related papers (2023-10-27T13:09:54Z) - Automated Metrics for Medical Multi-Document Summarization Disagree with
Human Evaluations [22.563596069176047]
We analyze how automated summarization evaluation metrics correlate with lexical features of generated summaries.
We find that not only do automated metrics fail to capture aspects of quality as assessed by humans, in many cases the system rankings produced by these metrics are anti-correlated with rankings according to human annotators.
arXiv Detail & Related papers (2023-05-23T05:00:59Z) - NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric
Preference Checklist [20.448405494617397]
Task-agnostic metrics, such as Perplexity, BLEU, BERTScore, are cost-effective and highly adaptable to diverse NLG tasks.
Human-aligned metrics (CTC, CtrlEval, UniEval) improves correlation level by incorporating desirable human-like qualities as training objective.
We show that automatic metrics provide a better guidance than human on discriminating system-level performance in Text Summarization and Controlled Generation tasks.
arXiv Detail & Related papers (2023-05-15T11:51:55Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation
of Story Generation [9.299255585127158]
There is no consensus on which human evaluation criteria to use.
No analysis of how well automatic criteria correlate with them.
HANNA allows us to quantitatively evaluate the correlations of 72 automatic metrics with human criteria.
arXiv Detail & Related papers (2022-08-24T16:35:32Z) - Deconstruct to Reconstruct a Configurable Evaluation Metric for
Open-Domain Dialogue Systems [36.73648357051916]
In open-domain dialogue, the overall quality is comprised of various aspects, such as relevancy, specificity, and empathy.
Existing metrics are not designed to cope with such flexibility.
We propose a simple method to composite metrics of each aspect to obtain a single metric called USL-H.
arXiv Detail & Related papers (2020-11-01T11:34:50Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.