SciEval: A Multi-Level Large Language Model Evaluation Benchmark for
Scientific Research
- URL: http://arxiv.org/abs/2308.13149v1
- Date: Fri, 25 Aug 2023 03:05:33 GMT
- Title: SciEval: A Multi-Level Large Language Model Evaluation Benchmark for
Scientific Research
- Authors: Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen,
Lu Chen and Kai Yu
- Abstract summary: We propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues.
Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability.
Both objective and subjective questions are included in SciEval.
- Score: 12.325362762629782
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, there has been growing interest in using Large Language Models
(LLMs) for scientific research. Numerous benchmarks have been proposed to
evaluate the ability of LLMs for scientific research. However, current
benchmarks are mostly based on pre-collected objective questions. This design
suffers from data leakage problem and lacks the evaluation of subjective Q/A
ability. In this paper, we propose SciEval, a comprehensive and
multi-disciplinary evaluation benchmark to address these issues. Based on
Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate
scientific research ability. In particular, we design a "dynamic" subset based
on scientific principles to prevent evaluation from potential data leakage.
Both objective and subjective questions are included in SciEval. These
characteristics make SciEval a more effective benchmark for scientific research
ability evaluation of LLMs. Comprehensive experiments on most advanced LLMs
show that, although GPT-4 achieves SOTA performance compared to other LLMs,
there is still substantial room for improvement, especially for dynamic
questions. The data and codes are now publicly available.
Related papers
- MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension [59.41495657570397]
We collected a multimodal, multidisciplinary dataset from open-access scientific articles published in Nature Communications journals.
This dataset spans 72 scientific disciplines, ensuring both diversity and quality.
We created benchmarks with various tasks and settings to comprehensively evaluate LMMs' capabilities in understanding scientific figures and content.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models [35.98892300665275]
SciKnowEval is a framework that evaluates Large Language Models (LLMs) across five progressive levels of scientific knowledge.
We benchmark 20 leading open-source and proprietary LLMs using zero-shot and few-shot prompting strategies.
The results reveal that despite achieving state-of-the-art performance, the proprietary LLMs still have considerable room for improvement.
arXiv Detail & Related papers (2024-06-13T13:27:52Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - Evaluating Large Language Models for Structured Science Summarization in the Open Research Knowledge Graph [18.41743815836192]
We propose using Large Language Models (LLMs) to automatically suggest properties for structured science summaries.
Our study performs a comprehensive comparative analysis between ORKG's manually curated properties and those generated by the aforementioned state-of-the-art LLMs.
Overall, LLMs show potential as recommendation systems for structuring science, but further finetuning is recommended to improve their alignment with scientific tasks and mimicry of human expertise.
arXiv Detail & Related papers (2024-05-03T14:03:04Z) - SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis [25.18030943975122]
Large Language Models (LLMs) have revolutionized natural language understanding and generation.
Existing benchmarks fail to adequately evaluate the proficiency of LLMs in scientific literature analysis.
We introduce SciAssess, a benchmark specifically designed for the comprehensive evaluation of LLMs in scientific literature analysis.
arXiv Detail & Related papers (2024-03-04T12:19:28Z) - Rethinking Scientific Summarization Evaluation: Grounding Explainable
Metrics on Facet-aware Benchmark [43.94573037950725]
This paper presents conceptual and experimental analyses of scientific summarization.
We introduce the Facet-aware Metric (FM), employing LLMs for advanced semantic matching to evaluate summaries.
Our findings confirm that FM offers a more logical approach to evaluating scientific summaries.
arXiv Detail & Related papers (2024-02-22T07:58:29Z) - F-Eval: Asssessing Fundamental Abilities with Refined Evaluation Methods [111.46455901113976]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM)
SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains.
The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.