KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation
- URL: http://arxiv.org/abs/2303.15422v4
- Date: Tue, 4 Jun 2024 10:00:56 GMT
- Title: KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation
- Authors: Di Wu, Da Yin, Kai-Wei Chang,
- Abstract summary: We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility.
Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
- Score: 69.57018875757622
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the significant advancements in keyphrase extraction and keyphrase generation methods, the predominant approach for evaluation mainly relies on exact matching with human references. This scheme fails to recognize systems that generate keyphrases semantically equivalent to the references or diverse keyphrases that carry practical utility. To better assess the capability of keyphrase systems, we propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility. For each aspect, we design semantic-based metrics to reflect the evaluation objectives. Meta-evaluation studies demonstrate that our evaluation strategy correlates better with human preferences compared to a range of previously proposed metrics. Using KPEval, we re-evaluate 23 keyphrase systems and discover that (1) established model comparison results have blind-spots especially when considering reference-free evaluation; (2) large language models are underestimated by prior evaluation works; and (3) there is no single best model that can excel in all the aspects.
Related papers
- MetaKP: On-Demand Keyphrase Generation [52.48698290354449]
We introduce on-demand keyphrase generation, a novel paradigm that requires keyphrases that conform to specific high-level goals or intents.
We present MetaKP, a large-scale benchmark comprising four datasets, 7500 documents, and 3760 goals across news and biomedical domains with human-annotated keyphrases.
We demonstrate the potential of our method to serve as a general NLP infrastructure, exemplified by its application in epidemic event detection from social media.
arXiv Detail & Related papers (2024-06-28T19:02:59Z) - F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - From Key Points to Key Point Hierarchy: Structured and Expressive
Opinion Summarization [9.567280503274226]
Key Point Analysis (KPA) has been recently proposed for deriving fine-grained insights from collections of textual comments.
We introduce the task of organizing a given set of key points into a hierarchy, according to their specificity.
We develop ThinkP, a high quality benchmark dataset of key point hierarchies for business and product reviews.
arXiv Detail & Related papers (2023-06-06T16:45:44Z) - Do You Hear The People Sing? Key Point Analysis via Iterative Clustering
and Abstractive Summarisation [12.548947151123555]
Argument summarisation is a promising but currently under-explored field.
One of the main challenges in Key Point Analysis is finding high-quality key point candidates.
evaluating key points is crucial in ensuring that the automatically generated summaries are useful.
arXiv Detail & Related papers (2023-05-25T12:43:29Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z) - Keyphrase Generation with Fine-Grained Evaluation-Guided Reinforcement
Learning [30.09715149060206]
Keyphrase Generation (KG) is a classical task for capturing the central idea from a given document.
In this paper, we propose a new fine-grained evaluation metric that considers different granularity.
For learning more recessive linguistic patterns, we use a pre-trained model (e.g., BERT) to compute the continuous similarity score between predicted keyphrases and target keyphrases.
arXiv Detail & Related papers (2021-04-18T10:13:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.