Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark
- URL: http://arxiv.org/abs/2402.14359v2
- Date: Fri, 02 May 2025 05:08:48 GMT
- Title: Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark
- Authors: Xiuying Chen, Tairan Wang, Qingqing Zhu, Taicheng Guo, Shen Gao, Zhiyong Lu, Xin Gao, Xiangliang Zhang,
- Abstract summary: This paper presents conceptual and experimental analyses of scientific summarization.<n>We introduce the Facet-aware Metric (FM), employing LLMs for advanced semantic matching to evaluate summaries.<n>Our findings confirm that FM offers a more logical approach to evaluating scientific summaries.
- Score: 42.131133762827375
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The summarization capabilities of pretrained and large language models (LLMs) have been widely validated in general areas, but their use in scientific corpus, which involves complex sentences and specialized knowledge, has been less assessed. This paper presents conceptual and experimental analyses of scientific summarization, highlighting the inadequacies of traditional evaluation methods, such as $n$-gram, embedding comparison, and QA, particularly in providing explanations, grasping scientific concepts, or identifying key content. Subsequently, we introduce the Facet-aware Metric (FM), employing LLMs for advanced semantic matching to evaluate summaries based on different aspects. This facet-aware approach offers a thorough evaluation of abstracts by decomposing the evaluation task into simpler subtasks.Recognizing the absence of an evaluation benchmark in this domain, we curate a Facet-based scientific summarization Dataset (FD) with facet-level annotations. Our findings confirm that FM offers a more logical approach to evaluating scientific summaries. In addition, fine-tuned smaller models can compete with LLMs in scientific contexts, while LLMs have limitations in learning from in-context information in scientific domains. This suggests an area for future enhancement of LLMs.
Related papers
- Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models [1.0138329337410974]
Large Language Models (LLMs) are trained on vast and diverse internet corpora that often include inaccurate or misleading content.<n>This review systematically analyzes how LLM-generated content is evaluated for factual accuracy.
arXiv Detail & Related papers (2025-08-05T19:20:05Z) - Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking [32.40639079110799]
SemRank is an effective and efficient paper retrieval framework.<n>It combines query understanding with a concept-based semantic index.<n> Experiments show that SemRank consistently improves the performance of various base retrievers.
arXiv Detail & Related papers (2025-05-27T22:49:18Z) - SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models [35.839640555805374]
SciCUEval is a benchmark dataset tailored to assess the scientific context understanding capability of Large Language Models (LLMs)<n>It comprises ten domain-specific sub-datasets spanning biology, chemistry, physics, biomedicine, and materials science, integrating diverse data modalities including structured tables, knowledge graphs, and unstructured texts.<n>It systematically evaluates four core competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference, through a variety of question formats.
arXiv Detail & Related papers (2025-05-21T04:33:26Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models [35.98892300665275]
We introduce the SciKnowEval benchmark, a framework that evaluates large language models (LLMs) across five progressive levels of scientific knowledge.
These levels aim to assess the breadth and depth of scientific knowledge in LLMs, including memory, comprehension, reasoning, discernment, and application.
We benchmark 26 advanced open-source and proprietary LLMs using zero-shot and few-shot prompting strategies.
arXiv Detail & Related papers (2024-06-13T13:27:52Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Evaluating Large Language Models for Structured Science Summarization in the Open Research Knowledge Graph [18.41743815836192]
We propose using Large Language Models (LLMs) to automatically suggest properties for structured science summaries.
Our study performs a comprehensive comparative analysis between ORKG's manually curated properties and those generated by the aforementioned state-of-the-art LLMs.
Overall, LLMs show potential as recommendation systems for structuring science, but further finetuning is recommended to improve their alignment with scientific tasks and mimicry of human expertise.
arXiv Detail & Related papers (2024-05-03T14:03:04Z) - FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis [26.111514038691837]
SciAssess is a benchmark for the comprehensive evaluation of Large Language Models (LLMs) in scientific literature analysis.
It aims to thoroughly assess the efficacy of LLMs by evaluating their capabilities in Memorization (L1), memorization (L2), and Analysis & Reasoning (L3)
It encompasses a variety of tasks drawn from diverse scientific fields, including biology, chemistry, material, and medicine.
arXiv Detail & Related papers (2024-03-04T12:19:28Z) - F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - Leveraging Large Language Models for NLG Evaluation: Advances and Challenges [57.88520765782177]
Large Language Models (LLMs) have opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance.
We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods.
By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.
arXiv Detail & Related papers (2024-01-13T15:59:09Z) - SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research [11.816426823341134]
We propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues.
Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability.
Both objective and subjective questions are included in SciEval.
arXiv Detail & Related papers (2023-08-25T03:05:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.