SCI-Verifier: Scientific Verifier with Thinking
- URL: http://arxiv.org/abs/2509.24285v1
- Date: Mon, 29 Sep 2025 04:58:43 GMT
- Title: SCI-Verifier: Scientific Verifier with Thinking
- Authors: Shenghe Zheng, Chenyu Huang, Fangchen Yu, Junchi Yao, Jingqi Ye, Tao Chen, Yun Luo, Ning Ding, LEI BAI, Ganqu Cui, Peng Ye,
- Abstract summary: Large language models (LLMs) are increasingly applied to scientific reasoning.<n>Existing verification studies in scientific domains suffer from two major limitations.<n>We propose solutions at both the data and model levels.
- Score: 37.08904000514563
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) are increasingly applied to scientific reasoning, the complexity of answer formats and the diversity of equivalent expressions make answer verification a critical yet challenging task. Existing verification studies in scientific domains suffer from two major limitations: (a) the absence of systematic evaluation standards and insufficient disciplinary coverage, which hinders their comprehensive assessment; and (b) heavy reliance on cumbersome rule design or prompt engineering, which reduces their effectiveness in complex reasoning scenarios or limits their cross-disciplinary generalization. To address these challenges, we propose solutions at both the data and model levels. On the data side, we construct SCI-VerifyBench, a cross-disciplinary benchmark covering mathematics, physics, biology, chemistry, and general scientific QA. The benchmark is built from real LLM responses and enhanced with domain-specific equivalence transformations that generate challenging and realistic data. Model-based and expert annotations ensure both quality and diversity, enabling rigorous evaluation of verification ability. On the model side, we emphasize the importance of reasoning for verification and introduce SCI-Verifier, a unified reasoning-augmented verifier for scientific domains. Through post-training, SCI-Verifier demonstrates strong logical reasoning and equivalence judgment capabilities while maintaining concise and stable outputs. Together, SCI-VerifyBench and SCI-Verifier provide a principled framework for scientific verification, offering both systematic evaluation and practical pathways to enhance the reliability and applicability of LLMs in scientific domains.
Related papers
- Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision [15.806243963561776]
Sci-CoE is a two-stage scientific co-evolving framework that enables models to self-evolve as both solver and verifier.<n>In the first stage, the model uses a small set of annotated data to establish correctness judgment anchors for the Verifier.<n>In the second stage, we introduce a geometric reward mechanism that jointly considers consensus, reliability, and diversity, driving large-scale self-iteration.
arXiv Detail & Related papers (2026-02-12T16:46:00Z) - Reward Modeling for Scientific Writing Evaluation [50.33952894976367]
It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks.<n>We propose cost-efficient, open-source reward models tailored for scientific writing evaluation.
arXiv Detail & Related papers (2026-01-16T15:32:58Z) - SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence [60.202862987441684]
We introduce scientific instruction following: the capability to solve problems while strictly adhering to the constraints that establish scientific validity.<n>Specifically, we introduce SciIF, a multi-discipline benchmark that evaluates this capability by pairing university-level problems with a fixed catalog of constraints.<n>By measuring both solution correctness and multi-constraint adherence, SciIF enables finegrained diagnosis of compositional reasoning failures.
arXiv Detail & Related papers (2026-01-08T09:45:58Z) - SciTrust 2.0: A Comprehensive Framework for Evaluating Trustworthiness of Large Language Models in Scientific Applications [0.9650932290026195]
Large language models (LLMs) have demonstrated transformative potential in scientific research, yet their deployment in high-stakes contexts raises significant trustworthiness concerns.<n>Here, we introduce SciTrust 2.0, a comprehensive framework for evaluating LLM trustworthiness in scientific applications.
arXiv Detail & Related papers (2025-10-29T19:22:55Z) - A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers [221.34650992288505]
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research.<n>This survey reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate.<n>We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge.
arXiv Detail & Related papers (2025-08-28T18:30:52Z) - Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning [53.82037883518254]
We introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks.<n>We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks.
arXiv Detail & Related papers (2025-08-26T17:04:23Z) - Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning [6.043212666944194]
We present CLAIM-BENCH, a benchmark for evaluating large language models' capabilities in scientific claim-evidence extraction and validation.<n>We show that closed-source models like GPT-4 and Claude consistently outperform open-source counterparts in precision and recall.<n> strategically designed three-pass and one-by-one prompting approaches significantly improve LLMs' abilities to accurately link dispersed evidence with claims.
arXiv Detail & Related papers (2025-06-09T21:04:39Z) - A Survey on Post-training of Large Language Models [185.51013463503946]
Large Language Models (LLMs) have fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration.<n>These challenges necessitate advanced post-training language models (PoLMs) to address shortcomings, such as restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance.<n>This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures ethical coherence and alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Integration and Adaptation, which
arXiv Detail & Related papers (2025-03-08T05:41:42Z) - On the Rigour of Scientific Writing: Criteria, Analysis, and Insights [15.055289544883534]
Rigour is crucial for scientific research as it ensures the validity and validity of results and findings.
We introduce a bottom-up, data-driven framework to automatically identify and define rigour criteria.
Our framework is domain-agnostic and can be tailored to the evaluation of scientific rigour for different areas.
arXiv Detail & Related papers (2024-10-07T12:22:06Z) - What is Reproducibility in Artificial Intelligence and Machine Learning Research? [0.7373617024876725]
We introduce a framework that clarifies the roles and definitions of key validation efforts.<n>This structured framework aims to provide AI/ML researchers with the necessary clarity on these essential concepts.
arXiv Detail & Related papers (2024-04-29T18:51:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.