Related papers: SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

URL: http://arxiv.org/abs/2308.13149v1
Date: Fri, 25 Aug 2023 03:05:33 GMT
Title: SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research
Authors: Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen and Kai Yu
Abstract summary: We propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues. Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability. Both objective and subjective questions are included in SciEval.
Score: 12.325362762629782
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, there has been growing interest in using Large Language Models (LLMs) for scientific research. Numerous benchmarks have been proposed to evaluate the ability of LLMs for scientific research. However, current benchmarks are mostly based on pre-collected objective questions. This design suffers from data leakage problem and lacks the evaluation of subjective Q/A ability. In this paper, we propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues. Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability. In particular, we design a "dynamic" subset based on scientific principles to prevent evaluation from potential data leakage. Both objective and subjective questions are included in SciEval. These characteristics make SciEval a more effective benchmark for scientific research ability evaluation of LLMs. Comprehensive experiments on most advanced LLMs show that, although GPT-4 achieves SOTA performance compared to other LLMs, there is still substantial room for improvement, especially for dynamic questions. The data and codes are now publicly available.

Related papers

YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering [0.0]
Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation remains underexplored.<n>We introduce YESciEval, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in evaluators.
arXiv Detail & Related papers (2025-05-20T12:30:46Z)
ArxivBench: Can LLMs Assist Researchers in Conducting Research? [6.586119023242877]
Large language models (LLMs) have demonstrated remarkable effectiveness in completing various tasks such as reasoning, translation, and question answering. In this study, we evaluate both proprietary and open-source LLMs on their ability to respond with relevant research papers and accurate links to articles hosted on the arXiv platform. Our findings reveal a concerning accuracy of LLM-generated responses depending on the subject, with some subjects experiencing significantly lower accuracy than others.
arXiv Detail & Related papers (2025-04-06T05:00:10Z)
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [67.26124739345332]
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined. We introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers.
arXiv Detail & Related papers (2025-03-27T08:09:15Z)
SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models [36.724471610075696]
We propose SciHorizon, a comprehensive assessment framework designed to benchmark the readiness of AI4Science.<n>First, we introduce a generalizable framework for assessing AI-ready scientific data, encompassing four key dimensions: Quality, FAIRness, Explainability, and Compliance.<n>We present recommendation lists of AI-ready datasets for Earth, Life, and Materials Sciences, making a novel and original contribution to the field.
arXiv Detail & Related papers (2025-03-12T11:34:41Z)
IdeaBench: Benchmarking Large Language Models for Research Idea Generation [19.66218274796796]
Large Language Models (LLMs) have transformed how people interact with artificial intelligence (AI) systems. We propose IdeaBench, a benchmark system that includes a comprehensive dataset and an evaluation framework. Our dataset comprises titles and abstracts from a diverse range of influential papers, along with their referenced works. Our evaluation framework is a two-stage process: first, using GPT-4o to rank ideas based on user-specified quality indicators such as novelty and feasibility, enabling scalable personalization.
arXiv Detail & Related papers (2024-10-31T17:04:59Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs) MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored. We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z)
SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models [35.98892300665275]
We introduce the SciKnowEval benchmark, a framework that evaluates large language models (LLMs) across five progressive levels of scientific knowledge. These levels aim to assess the breadth and depth of scientific knowledge in LLMs, including memory, comprehension, reasoning, discernment, and application. We benchmark 26 advanced open-source and proprietary LLMs using zero-shot and few-shot prompting strategies.
arXiv Detail & Related papers (2024-06-13T13:27:52Z)
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks. SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z)
Evaluating Large Language Models for Structured Science Summarization in the Open Research Knowledge Graph [18.41743815836192]
We propose using Large Language Models (LLMs) to automatically suggest properties for structured science summaries. Our study performs a comprehensive comparative analysis between ORKG's manually curated properties and those generated by the aforementioned state-of-the-art LLMs. Overall, LLMs show potential as recommendation systems for structuring science, but further finetuning is recommended to improve their alignment with scientific tasks and mimicry of human expertise.
arXiv Detail & Related papers (2024-05-03T14:03:04Z)
SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis [26.111514038691837]
SciAssess is a benchmark for the comprehensive evaluation of Large Language Models (LLMs) in scientific literature analysis. It aims to thoroughly assess the efficacy of LLMs by evaluating their capabilities in Memorization (L1), memorization (L2), and Analysis & Reasoning (L3) It encompasses a variety of tasks drawn from diverse scientific fields, including biology, chemistry, material, and medicine.
arXiv Detail & Related papers (2024-03-04T12:19:28Z)
Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark [43.94573037950725]
This paper presents conceptual and experimental analyses of scientific summarization. We introduce the Facet-aware Metric (FM), employing LLMs for advanced semantic matching to evaluate summaries. Our findings confirm that FM offers a more logical approach to evaluating scientific summaries.
arXiv Detail & Related papers (2024-02-22T07:58:29Z)
F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z)
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM) SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z)
A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry. This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.