Rethinking Scientific Summarization Evaluation: Grounding Explainable
Metrics on Facet-aware Benchmark
- URL: http://arxiv.org/abs/2402.14359v1
- Date: Thu, 22 Feb 2024 07:58:29 GMT
- Title: Rethinking Scientific Summarization Evaluation: Grounding Explainable
Metrics on Facet-aware Benchmark
- Authors: Xiuying Chen, Tairan Wang, Qingqing Zhu, Taicheng Guo, Shen Gao,
Zhiyong Lu, Xin Gao, Xiangliang Zhang
- Abstract summary: This paper presents conceptual and experimental analyses of scientific summarization.
We introduce the Facet-aware Metric (FM), employing LLMs for advanced semantic matching to evaluate summaries.
Our findings confirm that FM offers a more logical approach to evaluating scientific summaries.
- Score: 43.94573037950725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The summarization capabilities of pretrained and large language models (LLMs)
have been widely validated in general areas, but their use in scientific
corpus, which involves complex sentences and specialized knowledge, has been
less assessed. This paper presents conceptual and experimental analyses of
scientific summarization, highlighting the inadequacies of traditional
evaluation methods, such as $n$-gram, embedding comparison, and QA,
particularly in providing explanations, grasping scientific concepts, or
identifying key content. Subsequently, we introduce the Facet-aware Metric
(FM), employing LLMs for advanced semantic matching to evaluate summaries based
on different aspects. This facet-aware approach offers a thorough evaluation of
abstracts by decomposing the evaluation task into simpler subtasks.Recognizing
the absence of an evaluation benchmark in this domain, we curate a Facet-based
scientific summarization Dataset (FD) with facet-level annotations. Our
findings confirm that FM offers a more logical approach to evaluating
scientific summaries. In addition, fine-tuned smaller models can compete with
LLMs in scientific contexts, while LLMs have limitations in learning from
in-context information in scientific domains. This suggests an area for future
enhancement of LLMs.
Related papers
- Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Evaluating Large Language Models for Structured Science Summarization in the Open Research Knowledge Graph [18.41743815836192]
We propose using Large Language Models (LLMs) to automatically suggest properties for structured science summaries.
Our study performs a comprehensive comparative analysis between ORKG's manually curated properties and those generated by the aforementioned state-of-the-art LLMs.
Overall, LLMs show potential as recommendation systems for structuring science, but further finetuning is recommended to improve their alignment with scientific tasks and mimicry of human expertise.
arXiv Detail & Related papers (2024-05-03T14:03:04Z) - FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis [25.18030943975122]
Large Language Models (LLMs) have revolutionized natural language understanding and generation.
Existing benchmarks fail to adequately evaluate the proficiency of LLMs in scientific literature analysis.
We introduce SciAssess, a benchmark specifically designed for the comprehensive evaluation of LLMs in scientific literature analysis.
arXiv Detail & Related papers (2024-03-04T12:19:28Z) - Quantitative knowledge retrieval from large language models [4.155711233354597]
Large language models (LLMs) have been extensively studied for their abilities to generate convincing natural language sequences.
This paper explores the feasibility of LLMs as a mechanism for quantitative knowledge retrieval to aid data analysis tasks.
arXiv Detail & Related papers (2024-02-12T16:32:37Z) - F-Eval: Asssessing Fundamental Abilities with Refined Evaluation Methods [111.46455901113976]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - Leveraging Large Language Models for NLG Evaluation: Advances and Challenges [57.88520765782177]
Large Language Models (LLMs) have opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance.
We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods.
By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.
arXiv Detail & Related papers (2024-01-13T15:59:09Z) - InFoBench: Evaluating Instruction Following Ability in Large Language
Models [57.27152890085759]
Decomposed Requirements Following Ratio (DRFR) is a new metric for evaluating Large Language Models' (LLMs) ability to follow instructions.
We present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories.
arXiv Detail & Related papers (2024-01-07T23:01:56Z) - SciEval: A Multi-Level Large Language Model Evaluation Benchmark for
Scientific Research [12.325362762629782]
We propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues.
Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability.
Both objective and subjective questions are included in SciEval.
arXiv Detail & Related papers (2023-08-25T03:05:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.