The Ever-Evolving Science Exam
- URL: http://arxiv.org/abs/2507.16514v3
- Date: Tue, 30 Sep 2025 05:00:52 GMT
- Title: The Ever-Evolving Science Exam
- Authors: Junying Wang, Zicheng Zhang, Yijin Guo, Farong Wen, Ye Shen, Yingji Liang, Yalun Wu, Wenzhe Li, Chunyi Li, Zijian Chen, Qi Jia, Guangtao Zhai,
- Abstract summary: We introduce the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to reliably assess scientific capabilities in foundation models.<n>Our approach consists of two components: 1) a non-public EESE-Pool with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring Range, Reach, and Rigor, and 2) a periodically updated 500-instance subset EESE, sampled and validated to enable leakage-resilient, low-overhead evaluations.
- Score: 69.20851050366643
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad Range, wide Reach, and high Rigor, yet they often face two major challenges: data leakage risks that compromise benchmarking validity, and evaluation inefficiency due to large-scale testing. To address these issues, we introduce the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public EESE-Pool with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring Range, Reach, and Rigor, 2) a periodically updated 500-instance subset EESE, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: https://github.com/aiben-ch/EESE.
Related papers
- Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision [15.806243963561776]
Sci-CoE is a two-stage scientific co-evolving framework that enables models to self-evolve as both solver and verifier.<n>In the first stage, the model uses a small set of annotated data to establish correctness judgment anchors for the Verifier.<n>In the second stage, we introduce a geometric reward mechanism that jointly considers consensus, reliability, and diversity, driving large-scale self-iteration.
arXiv Detail & Related papers (2026-02-12T16:46:00Z) - HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery [50.8841471967624]
HiSciBench is a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow.<n>HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines.
arXiv Detail & Related papers (2025-12-28T12:08:05Z) - Evaluating Large Language Models in Scientific Discovery [91.732562776782]
Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge.<n>We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics.<n>The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance.
arXiv Detail & Related papers (2025-12-17T16:20:03Z) - ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning [118.46980291324148]
ATLAS is a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems.<n>Its key features include: High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage.<n>Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities.
arXiv Detail & Related papers (2025-11-18T11:13:06Z) - SciTrust 2.0: A Comprehensive Framework for Evaluating Trustworthiness of Large Language Models in Scientific Applications [0.9650932290026195]
Large language models (LLMs) have demonstrated transformative potential in scientific research, yet their deployment in high-stakes contexts raises significant trustworthiness concerns.<n>Here, we introduce SciTrust 2.0, a comprehensive framework for evaluating LLM trustworthiness in scientific applications.
arXiv Detail & Related papers (2025-10-29T19:22:55Z) - Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark [49.42250115889234]
We present the first benchmark designed to test large language models (LLMs) on research-level reasoning tasks.<n>CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level.<n>We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges.
arXiv Detail & Related papers (2025-09-30T17:34:03Z) - A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers [251.23085679210206]
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research.<n>This survey reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate.<n>We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge.
arXiv Detail & Related papers (2025-08-28T18:30:52Z) - MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning [24.72798058808192]
We present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level textbooks.<n>We introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances.<n>Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths.
arXiv Detail & Related papers (2025-07-22T17:59:03Z) - PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models [69.73115077227969]
We present PhysUniBench, a large-scale benchmark designed to evaluate and improve the reasoning capabilities of large language models (MLLMs)<n>PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagram.<n>The benchmark's construction involved a rigorous multi-stage process, including multiple roll-outs, expert-level evaluation, automated filtering of easily solved problems, and a nuanced difficulty grading system with five levels.
arXiv Detail & Related papers (2025-06-21T09:55:42Z) - SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification [29.63899315962693]
SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification.<n>We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL.<n>Our experiment reveals a substantial performance gap between these models and human experts on SciVer.
arXiv Detail & Related papers (2025-06-18T15:43:26Z) - CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning [12.396302011805755]
We introduce CURIE, a benchmark to measure the potential of Large Language Models (LLMs) in scientific problem-solving.<n>This benchmark introduces ten challenging tasks with a total of 580 problems and solution pairs curated by experts in six disciplines.<n>We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in-context information,and multi-step reasoning.
arXiv Detail & Related papers (2025-03-14T17:53:03Z) - SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models [36.724471610075696]
We propose SciHorizon, a comprehensive assessment framework designed to benchmark the readiness of AI4Science.<n>First, we introduce a generalizable framework for assessing AI-ready scientific data, encompassing four key dimensions: Quality, FAIRness, Explainability, and Compliance.<n>We present recommendation lists of AI-ready datasets for Earth, Life, and Materials Sciences, making a novel and original contribution to the field.
arXiv Detail & Related papers (2025-03-12T11:34:41Z) - A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper proposes a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods.<n>The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics.<n>We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z) - OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [62.06169250463104]
We present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions.
The best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics.
Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies.
arXiv Detail & Related papers (2024-02-21T18:49:26Z) - SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research [11.816426823341134]
We propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues.
Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability.
Both objective and subjective questions are included in SciEval.
arXiv Detail & Related papers (2023-08-25T03:05:33Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [49.15931834209624]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.<n>We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.<n>By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - SCITAB: A Challenging Benchmark for Compositional Reasoning and Claim
Verification on Scientific Tables [68.76415918462418]
We present SCITAB, a challenging evaluation dataset consisting of 1.2K expert-verified scientific claims.
Through extensive evaluations, we demonstrate that SCITAB poses a significant challenge to state-of-the-art models.
Our analysis uncovers several unique challenges posed by SCITAB, including table grounding, claim ambiguity, and compositional reasoning.
arXiv Detail & Related papers (2023-05-22T16:13:50Z) - GFlowNets for AI-Driven Scientific Discovery [74.27219800878304]
We present a new probabilistic machine learning framework called GFlowNets.
GFlowNets can be applied in the modeling, hypotheses generation and experimental design stages of the experimental science loop.
We argue that GFlowNets can become a valuable tool for AI-driven scientific discovery.
arXiv Detail & Related papers (2023-02-01T17:29:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.