Rethinking Scientific Summarization Evaluation: Grounding Explainable
Metrics on Facet-aware Benchmark
- URL: http://arxiv.org/abs/2402.14359v1
- Date: Thu, 22 Feb 2024 07:58:29 GMT
- Title: Rethinking Scientific Summarization Evaluation: Grounding Explainable
Metrics on Facet-aware Benchmark
- Authors: Xiuying Chen, Tairan Wang, Qingqing Zhu, Taicheng Guo, Shen Gao,
Zhiyong Lu, Xin Gao, Xiangliang Zhang
- Abstract summary: This paper presents conceptual and experimental analyses of scientific summarization.
We introduce the Facet-aware Metric (FM), employing LLMs for advanced semantic matching to evaluate summaries.
Our findings confirm that FM offers a more logical approach to evaluating scientific summaries.
- Score: 43.94573037950725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The summarization capabilities of pretrained and large language models (LLMs)
have been widely validated in general areas, but their use in scientific
corpus, which involves complex sentences and specialized knowledge, has been
less assessed. This paper presents conceptual and experimental analyses of
scientific summarization, highlighting the inadequacies of traditional
evaluation methods, such as $n$-gram, embedding comparison, and QA,
particularly in providing explanations, grasping scientific concepts, or
identifying key content. Subsequently, we introduce the Facet-aware Metric
(FM), employing LLMs for advanced semantic matching to evaluate summaries based
on different aspects. This facet-aware approach offers a thorough evaluation of
abstracts by decomposing the evaluation task into simpler subtasks.Recognizing
the absence of an evaluation benchmark in this domain, we curate a Facet-based
scientific summarization Dataset (FD) with facet-level annotations. Our
findings confirm that FM offers a more logical approach to evaluating
scientific summaries. In addition, fine-tuned smaller models can compete with
LLMs in scientific contexts, while LLMs have limitations in learning from
in-context information in scientific domains. This suggests an area for future
enhancement of LLMs.
Related papers
- Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models [35.98892300665275]
We introduce the SciKnowEval benchmark, a framework that evaluates large language models (LLMs) across five progressive levels of scientific knowledge.
These levels aim to assess the breadth and depth of scientific knowledge in LLMs, including memory, comprehension, reasoning, discernment, and application.
We benchmark 26 advanced open-source and proprietary LLMs using zero-shot and few-shot prompting strategies.
arXiv Detail & Related papers (2024-06-13T13:27:52Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Evaluating Large Language Models for Structured Science Summarization in the Open Research Knowledge Graph [18.41743815836192]
We propose using Large Language Models (LLMs) to automatically suggest properties for structured science summaries.
Our study performs a comprehensive comparative analysis between ORKG's manually curated properties and those generated by the aforementioned state-of-the-art LLMs.
Overall, LLMs show potential as recommendation systems for structuring science, but further finetuning is recommended to improve their alignment with scientific tasks and mimicry of human expertise.
arXiv Detail & Related papers (2024-05-03T14:03:04Z) - FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis [26.111514038691837]
SciAssess is a benchmark for the comprehensive evaluation of Large Language Models (LLMs) in scientific literature analysis.
It aims to thoroughly assess the efficacy of LLMs by evaluating their capabilities in Memorization (L1), memorization (L2), and Analysis & Reasoning (L3)
It encompasses a variety of tasks drawn from diverse scientific fields, including biology, chemistry, material, and medicine.
arXiv Detail & Related papers (2024-03-04T12:19:28Z) - F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - Leveraging Large Language Models for NLG Evaluation: Advances and Challenges [57.88520765782177]
Large Language Models (LLMs) have opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance.
We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods.
By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.
arXiv Detail & Related papers (2024-01-13T15:59:09Z) - SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research [11.816426823341134]
We propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues.
Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability.
Both objective and subjective questions are included in SciEval.
arXiv Detail & Related papers (2023-08-25T03:05:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.