SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models
- URL: http://arxiv.org/abs/2505.15094v1
- Date: Wed, 21 May 2025 04:33:26 GMT
- Title: SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models
- Authors: Jing Yu, Yuqi Tang, Kehua Feng, Mingyang Rao, Lei Liang, Zhiqiang Zhang, Mengshu Sun, Wen Zhang, Qiang Zhang, Keyan Ding, Huajun Chen,
- Abstract summary: SciCUEval is a benchmark dataset tailored to assess the scientific context understanding capability of Large Language Models (LLMs)<n>It comprises ten domain-specific sub-datasets spanning biology, chemistry, physics, biomedicine, and materials science, integrating diverse data modalities including structured tables, knowledge graphs, and unstructured texts.<n>It systematically evaluates four core competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference, through a variety of question formats.
- Score: 35.839640555805374
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have shown impressive capabilities in contextual understanding and reasoning. However, evaluating their performance across diverse scientific domains remains underexplored, as existing benchmarks primarily focus on general domains and fail to capture the intricate complexity of scientific data. To bridge this gap, we construct SciCUEval, a comprehensive benchmark dataset tailored to assess the scientific context understanding capability of LLMs. It comprises ten domain-specific sub-datasets spanning biology, chemistry, physics, biomedicine, and materials science, integrating diverse data modalities including structured tables, knowledge graphs, and unstructured texts. SciCUEval systematically evaluates four core competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference, through a variety of question formats. We conduct extensive evaluations of state-of-the-art LLMs on SciCUEval, providing a fine-grained analysis of their strengths and limitations in scientific context understanding, and offering valuable insights for the future development of scientific-domain LLMs.
Related papers
- RLDBF: Enhancing LLMs Via Reinforcement Learning With DataBase FeedBack [15.24890160206967]
This study pioneers a systematic investigation into enhancing large language models with structured scientific data.<n>To address the inherent limitation of numerical insensitivity in large models, we propose an innovative methodology termed "Reinforcement Learning with Database Feedback"
arXiv Detail & Related papers (2025-03-28T14:18:29Z) - SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models [36.724471610075696]
We propose SciHorizon, a comprehensive assessment framework designed to benchmark the readiness of AI4Science.<n>First, we introduce a generalizable framework for assessing AI-ready scientific data, encompassing four key dimensions: Quality, FAIRness, Explainability, and Compliance.<n>We present recommendation lists of AI-ready datasets for Earth, Life, and Materials Sciences, making a novel and original contribution to the field.
arXiv Detail & Related papers (2025-03-12T11:34:41Z) - Beyond Factual Accuracy: Evaluating Coverage of Diverse Factual Information in Long-form Text Generation [56.82274763974443]
ICAT is an evaluation framework for measuring coverage of diverse factual information in long-form text generation.<n>It computes the alignment between the atomic factual claims and various aspects expected to be presented in the output.<n>Our framework offers interpretable and fine-grained analysis of diversity and coverage.
arXiv Detail & Related papers (2025-01-07T05:43:23Z) - SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers [20.273439120429025]
SciDQA is a new dataset for reading comprehension that challenges LLMs for a deep understanding of scientific articles.
Unlike other scientific QA datasets, SciDQA sources questions from peer reviews by domain experts and answers by paper authors.
Questions in SciDQA necessitate reasoning across figures, tables, equations, appendices, and supplementary materials.
arXiv Detail & Related papers (2024-11-08T05:28:22Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled.
We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z) - SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models [35.98892300665275]
We introduce the SciKnowEval benchmark, a framework that evaluates large language models (LLMs) across five progressive levels of scientific knowledge.
These levels aim to assess the breadth and depth of scientific knowledge in LLMs, including memory, comprehension, reasoning, discernment, and application.
We benchmark 26 advanced open-source and proprietary LLMs using zero-shot and few-shot prompting strategies.
arXiv Detail & Related papers (2024-06-13T13:27:52Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark [42.131133762827375]
This paper presents conceptual and experimental analyses of scientific summarization.<n>We introduce the Facet-aware Metric (FM), employing LLMs for advanced semantic matching to evaluate summaries.<n>Our findings confirm that FM offers a more logical approach to evaluating scientific summaries.
arXiv Detail & Related papers (2024-02-22T07:58:29Z) - SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval [64.03631654052445]
Current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap.
We develop a specialised scientific MMIR benchmark by leveraging open-access paper collections.
This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents.
arXiv Detail & Related papers (2024-01-24T14:23:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.