Related papers: SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

URL: http://arxiv.org/abs/2406.09098v1
Date: Thu, 13 Jun 2024 13:27:52 GMT
Title: SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models
Authors: Kehua Feng, Keyan Ding, Weijie Wang, Xiang Zhuang, Zeyuan Wang, Ming Qin, Yu Zhao, Jianhua Yao, Qiang Zhang, Huajun Chen,
Abstract summary: SciKnowEval is a framework that evaluates Large Language Models (LLMs) across five progressive levels of scientific knowledge. We benchmark 20 leading open-source and proprietary LLMs using zero-shot and few-shot prompting strategies. The results reveal that despite achieving state-of-the-art performance, the proprietary LLMs still have considerable room for improvement.
Score: 35.98892300665275
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The burgeoning utilization of Large Language Models (LLMs) in scientific research necessitates advanced benchmarks capable of evaluating their understanding and application of scientific knowledge comprehensively. To address this need, we introduce the SciKnowEval benchmark, a novel framework that systematically evaluates LLMs across five progressive levels of scientific knowledge: studying extensively, inquiring earnestly, thinking profoundly, discerning clearly, and practicing assiduously. These levels aim to assess the breadth and depth of scientific knowledge in LLMs, including knowledge coverage, inquiry and exploration capabilities, reflection and reasoning abilities, ethic and safety considerations, as well as practice proficiency. Specifically, we take biology and chemistry as the two instances of SciKnowEval and construct a dataset encompassing 50K multi-level scientific problems and solutions. By leveraging this dataset, we benchmark 20 leading open-source and proprietary LLMs using zero-shot and few-shot prompting strategies. The results reveal that despite achieving state-of-the-art performance, the proprietary LLMs still have considerable room for improvement, particularly in addressing scientific computations and applications. We anticipate that SciKnowEval will establish a comprehensive standard for benchmarking LLMs in science research and discovery, and promote the development of LLMs that integrate scientific knowledge with strong safety awareness. The dataset and code are publicly available at https://github.com/hicai-zju/sciknoweval .

Related papers

Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning [59.518397361341556]
We present the Scientists' First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of Multimodal Large Language Models (MLLMs)<n>SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines.<n>Experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms.
arXiv Detail & Related papers (2025-06-12T09:29:16Z)
ScienceMeter: Tracking Scientific Knowledge Updates in Language Models [79.33626657942169]
Large Language Models (LLMs) are increasingly used to support scientific research, but their knowledge of scientific advancements can quickly become outdated.<n>We introduce ScienceMeter, a new framework for evaluating scientific knowledge update methods over scientific knowledge spanning the past, present, and future.
arXiv Detail & Related papers (2025-05-30T07:28:20Z)
EarthSE: A Benchmark for Evaluating Earth Scientific Exploration Capability of LLMs [36.72915099998998]
We present a professional benchmark for the Earth sciences, designed to evaluate the capabilities of Large Language Models (LLMs) in scientific exploration.<n>Leveraging a corpus of 100,000 research papers, we first construct two Question Answering (QA) datasets: Earth-Iron and Earth-Silver.<n>These datasets encompass five Earth spheres, 114 disciplines, and 11 task categories, assessing knowledge crucial for scientific exploration.
arXiv Detail & Related papers (2025-05-22T06:46:08Z)
SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models [35.839640555805374]
SciCUEval is a benchmark dataset tailored to assess the scientific context understanding capability of Large Language Models (LLMs)<n>It comprises ten domain-specific sub-datasets spanning biology, chemistry, physics, biomedicine, and materials science, integrating diverse data modalities including structured tables, knowledge graphs, and unstructured texts.<n>It systematically evaluates four core competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference, through a variety of question formats.
arXiv Detail & Related papers (2025-05-21T04:33:26Z)
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning [51.11965014462375]
Multimodal Large Language Models (MLLMs) integrate text, images, and other modalities. This paper argues that MLLMs can significantly advance scientific reasoning across disciplines such as mathematics, physics, chemistry, and biology.
arXiv Detail & Related papers (2025-02-05T04:05:27Z)
SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding [22.131371019641417]
Despite Large Language Models' success, they face challenges in scientific literature understanding. We propose a hybrid strategy that integrates continual pre-training (CPT) and supervised fine-tuning (SFT) We present a suite of LLMs: SciLitLLM, specialized in scientific literature understanding.
arXiv Detail & Related papers (2024-08-28T05:41:52Z)
A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled. We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z)
Mapping the Increasing Use of LLMs in Scientific Papers [99.67983375899719]
We conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals. Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers.
arXiv Detail & Related papers (2024-04-01T17:45:15Z)
SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis [26.111514038691837]
SciAssess is a benchmark for the comprehensive evaluation of Large Language Models (LLMs) in scientific literature analysis. It aims to thoroughly assess the efficacy of LLMs by evaluating their capabilities in Memorization (L1), memorization (L2), and Analysis & Reasoning (L3) It encompasses a variety of tasks drawn from diverse scientific fields, including biology, chemistry, material, and medicine.
arXiv Detail & Related papers (2024-03-04T12:19:28Z)
Scientific Large Language Models: A Survey on Biological & Chemical Domains [47.97810890521825]
Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines. As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration.
arXiv Detail & Related papers (2024-01-26T05:33:34Z)
Exploring the Cognitive Knowledge Structure of Large Language Models: An Educational Diagnostic Assessment Approach [50.125704610228254]
Large Language Models (LLMs) have not only exhibited exceptional performance across various tasks, but also demonstrated sparks of intelligence. Recent studies have focused on assessing their capabilities on human exams and revealed their impressive competence in different domains. We conduct an evaluation using MoocRadar, a meticulously annotated human test dataset based on Bloom taxonomy.
arXiv Detail & Related papers (2023-10-12T09:55:45Z)
SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research [11.816426823341134]
We propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues. Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability. Both objective and subjective questions are included in SciEval.
arXiv Detail & Related papers (2023-08-25T03:05:33Z)
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM) SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.