Related papers: SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence

SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence

URL: http://arxiv.org/abs/2512.22334v2
Date: Tue, 30 Dec 2025 02:13:58 GMT
Title: SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence
Authors: Yiheng Wang, Yixin Chen, Shuo Li, Yifan Zhou, Bo Liu, Hengjian Gao, Jiakang Yuan, Jia Bu, Wanghan Xu, Yuhao Zhou, Xiangyu Zhao, Zhiwang Zhou, Fengxiang Wang, Haodong Duan, Songyang Zhang, Jun Yao, Han Deng, Yizhou Wang, Jiabei Xiao, Jiaqi Liu, Encheng Su, Yujie Liu, Weida Wang, Junchi Yao, Shenghe Zheng, Haoran Sun, Runmin Ma, Xiangchao Yan, Bo Zhang, Dongzhan Zhou, Shufei Zhang, Peng Ye, Xiaosong Wang, Shixiang Tang, Wenlong Zhang, Lei Bai,
Abstract summary: SciEvalKit focuses on the core competencies of scientific intelligence.<n>It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science.<n>The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science.
Score: 99.30934038146965
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific Knowledge Understanding. It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch evaluation across models and datasets, supports custom model and dataset integration, and provides transparent, reproducible, and comparable results. By bridging capability-based evaluation and disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science.

Related papers

HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery [50.8841471967624]
HiSciBench is a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow.<n>HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines.
arXiv Detail & Related papers (2025-12-28T12:08:05Z)
AInsteinBench: Benchmarking Coding Agents on Scientific Repositories [33.48206557020983]
AInsteinBench is a large-scale benchmark for evaluating whether large language model (LLM) agents can operate as scientific computing development agents.<n>AInsteinBench measures a model's ability to move beyond surface-level code generation toward the core competencies required for computational scientific research.
arXiv Detail & Related papers (2025-12-24T08:11:11Z)
Autonomous Agents for Scientific Discovery: Orchestrating Scientists, Language, Code, and Physics [82.55776608452017]
Large language models (LLMs) provide a flexible and versatile framework that orchestrates interactions with human scientists, natural language, computer language and code, and physics.<n>This paper presents our view and vision of LLM-based scientific agents and their growing role in transforming the scientific discovery lifecycle.<n>We identify open research challenges and outline promising directions for building more robust, generalizable, and adaptive scientific agents.
arXiv Detail & Related papers (2025-10-10T22:26:26Z)
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers [251.23085679210206]
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research.<n>This survey reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate.<n>We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge.
arXiv Detail & Related papers (2025-08-28T18:30:52Z)
From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery [108.1082357960201]
Agentic AI shows capabilities in hypothesis generation, experimental design, execution, analysis, and iterative refinement.<n>This survey provides a domain-oriented review of autonomous scientific discovery across life sciences, chemistry, materials science, and physics.
arXiv Detail & Related papers (2025-08-18T05:25:54Z)
MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning [32.21228080662089]
We present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level textbooks.<n>We introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances.<n>Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths.
arXiv Detail & Related papers (2025-07-22T17:59:03Z)
SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models [35.839640555805374]
SciCUEval is a benchmark dataset tailored to assess the scientific context understanding capability of Large Language Models (LLMs)<n>It comprises ten domain-specific sub-datasets spanning biology, chemistry, physics, biomedicine, and materials science, integrating diverse data modalities including structured tables, knowledge graphs, and unstructured texts.<n>It systematically evaluates four core competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference, through a variety of question formats.
arXiv Detail & Related papers (2025-05-21T04:33:26Z)
SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions [52.35520385083425]
We present SciDMT, an enhanced and expanded corpus for scientific mention detection. The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes.
arXiv Detail & Related papers (2024-06-20T22:03:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.