Revisiting Automatic Question Summarization Evaluation in the Biomedical
Domain
- URL: http://arxiv.org/abs/2303.10328v1
- Date: Sat, 18 Mar 2023 04:28:01 GMT
- Title: Revisiting Automatic Question Summarization Evaluation in the Biomedical
Domain
- Authors: Hongyi Yuan, Yaoyun Zhang, Fei Huang, Songfang Huang
- Abstract summary: We conduct human evaluations of summarization quality from four different aspects of a biomedical question summarization task.
Based on human judgments, we identify different noteworthy features for current automatic metrics and summarization systems.
- Score: 45.78632945525459
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic evaluation metrics have been facilitating the rapid development of
automatic summarization methods by providing instant and fair assessments of
the quality of summaries. Most metrics have been developed for the general
domain, especially news and meeting notes, or other language-generation tasks.
However, these metrics are applied to evaluate summarization systems in
different domains, such as biomedical question summarization. To better
understand whether commonly used evaluation metrics are capable of evaluating
automatic summarization in the biomedical domain, we conduct human evaluations
of summarization quality from four different aspects of a biomedical question
summarization task. Based on human judgments, we identify different noteworthy
features for current automatic metrics and summarization systems as well. We
also release a dataset of our human annotations to aid the research of
summarization evaluation metrics in the biomedical domain.
Related papers
- A Critical Look at Meta-evaluating Summarisation Evaluation Metrics [11.541368732416506]
We argue that the time is ripe to build more diverse benchmarks that enable the development of more robust evaluation metrics.
We call for research focusing on user-centric quality dimensions that consider the generated summary's communicative goal.
arXiv Detail & Related papers (2024-09-29T01:30:13Z) - OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization [52.720711541731205]
We present OpinSummEval, a dataset comprising human judgments and outputs from 14 opinion summarization models.
Our findings indicate that metrics based on neural networks generally outperform non-neural ones.
arXiv Detail & Related papers (2023-10-27T13:09:54Z) - An Investigation of Evaluation Metrics for Automated Medical Note
Generation [5.094623170336122]
We study evaluation methods and metrics for the automatic generation of clinical notes from medical conversations.
To study the correlation between the automatic metrics and manual judgments, we evaluate automatic notes/summaries by comparing the system and reference facts.
arXiv Detail & Related papers (2023-05-27T04:34:58Z) - Automated Metrics for Medical Multi-Document Summarization Disagree with
Human Evaluations [22.563596069176047]
We analyze how automated summarization evaluation metrics correlate with lexical features of generated summaries.
We find that not only do automated metrics fail to capture aspects of quality as assessed by humans, in many cases the system rankings produced by these metrics are anti-correlated with rankings according to human annotators.
arXiv Detail & Related papers (2023-05-23T05:00:59Z) - Consultation Checklists: Standardising the Human Evaluation of Medical
Note Generation [58.54483567073125]
We propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists.
We observed good levels of inter-annotator agreement in a first evaluation study using the protocol.
arXiv Detail & Related papers (2022-11-17T10:54:28Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z) - Automated Lay Language Summarization of Biomedical Scientific Reviews [16.01452242066412]
Health literacy has emerged as a crucial factor in making appropriate health decisions and ensuring treatment outcomes.
Medical jargon and the complex structure of professional language in this domain make health information especially hard to interpret.
This paper introduces the novel task of automated generation of lay language summaries of biomedical scientific reviews.
arXiv Detail & Related papers (2020-12-23T10:01:18Z) - Re-evaluating Evaluation in Text Summarization [77.4601291738445]
We re-evaluate the evaluation method for text summarization using top-scoring system outputs.
We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
arXiv Detail & Related papers (2020-10-14T13:58:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.