Related papers: SCI-Verifier: Scientific Verifier with Thinking

SCI-Verifier: Scientific Verifier with Thinking

URL: http://arxiv.org/abs/2509.24285v1
Date: Mon, 29 Sep 2025 04:58:43 GMT
Title: SCI-Verifier: Scientific Verifier with Thinking
Authors: Shenghe Zheng, Chenyu Huang, Fangchen Yu, Junchi Yao, Jingqi Ye, Tao Chen, Yun Luo, Ning Ding, LEI BAI, Ganqu Cui, Peng Ye,
Abstract summary: Large language models (LLMs) are increasingly applied to scientific reasoning.<n>Existing verification studies in scientific domains suffer from two major limitations.<n>We propose solutions at both the data and model levels.
Score: 37.08904000514563
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) are increasingly applied to scientific reasoning, the complexity of answer formats and the diversity of equivalent expressions make answer verification a critical yet challenging task. Existing verification studies in scientific domains suffer from two major limitations: (a) the absence of systematic evaluation standards and insufficient disciplinary coverage, which hinders their comprehensive assessment; and (b) heavy reliance on cumbersome rule design or prompt engineering, which reduces their effectiveness in complex reasoning scenarios or limits their cross-disciplinary generalization. To address these challenges, we propose solutions at both the data and model levels. On the data side, we construct SCI-VerifyBench, a cross-disciplinary benchmark covering mathematics, physics, biology, chemistry, and general scientific QA. The benchmark is built from real LLM responses and enhanced with domain-specific equivalence transformations that generate challenging and realistic data. Model-based and expert annotations ensure both quality and diversity, enabling rigorous evaluation of verification ability. On the model side, we emphasize the importance of reasoning for verification and introduce SCI-Verifier, a unified reasoning-augmented verifier for scientific domains. Through post-training, SCI-Verifier demonstrates strong logical reasoning and equivalence judgment capabilities while maintaining concise and stable outputs. Together, SCI-VerifyBench and SCI-Verifier provide a principled framework for scientific verification, offering both systematic evaluation and practical pathways to enhance the reliability and applicability of LLMs in scientific domains.

Related papers

Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision [15.806243963561776]
Sci-CoE is a two-stage scientific co-evolving framework that enables models to self-evolve as both solver and verifier.<n>In the first stage, the model uses a small set of annotated data to establish correctness judgment anchors for the Verifier.<n>In the second stage, we introduce a geometric reward mechanism that jointly considers consensus, reliability, and diversity, driving large-scale self-iteration.
arXiv Detail & Related papers (2026-02-12T16:46:00Z)
Reward Modeling for Scientific Writing Evaluation [50.33952894976367]
It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks.<n>We propose cost-efficient, open-source reward models tailored for scientific writing evaluation.
arXiv Detail & Related papers (2026-01-16T15:32:58Z)
SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence [60.202862987441684]
We introduce scientific instruction following: the capability to solve problems while strictly adhering to the constraints that establish scientific validity.<n>Specifically, we introduce SciIF, a multi-discipline benchmark that evaluates this capability by pairing university-level problems with a fixed catalog of constraints.<n>By measuring both solution correctness and multi-constraint adherence, SciIF enables finegrained diagnosis of compositional reasoning failures.
arXiv Detail & Related papers (2026-01-08T09:45:58Z)
SciTrust 2.0: A Comprehensive Framework for Evaluating Trustworthiness of Large Language Models in Scientific Applications [0.9650932290026195]
Large language models (LLMs) have demonstrated transformative potential in scientific research, yet their deployment in high-stakes contexts raises significant trustworthiness concerns.<n>Here, we introduce SciTrust 2.0, a comprehensive framework for evaluating LLM trustworthiness in scientific applications.
arXiv Detail & Related papers (2025-10-29T19:22:55Z)
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers [221.34650992288505]
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research.<n>This survey reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate.<n>We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge.
arXiv Detail & Related papers (2025-08-28T18:30:52Z)
Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning [53.82037883518254]
We introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks.<n>We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks.
arXiv Detail & Related papers (2025-08-26T17:04:23Z)
Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning [6.043212666944194]
We present CLAIM-BENCH, a benchmark for evaluating large language models' capabilities in scientific claim-evidence extraction and validation.<n>We show that closed-source models like GPT-4 and Claude consistently outperform open-source counterparts in precision and recall.<n> strategically designed three-pass and one-by-one prompting approaches significantly improve LLMs' abilities to accurately link dispersed evidence with claims.
arXiv Detail & Related papers (2025-06-09T21:04:39Z)
A Survey on Post-training of Large Language Models [185.51013463503946]
Large Language Models (LLMs) have fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration.<n>These challenges necessitate advanced post-training language models (PoLMs) to address shortcomings, such as restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance.<n>This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures ethical coherence and alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Integration and Adaptation, which
arXiv Detail & Related papers (2025-03-08T05:41:42Z)
On the Rigour of Scientific Writing: Criteria, Analysis, and Insights [15.055289544883534]
Rigour is crucial for scientific research as it ensures the validity and validity of results and findings. We introduce a bottom-up, data-driven framework to automatically identify and define rigour criteria. Our framework is domain-agnostic and can be tailored to the evaluation of scientific rigour for different areas.
arXiv Detail & Related papers (2024-10-07T12:22:06Z)
What is Reproducibility in Artificial Intelligence and Machine Learning Research? [0.7373617024876725]
We introduce a framework that clarifies the roles and definitions of key validation efforts.<n>This structured framework aims to provide AI/ML researchers with the necessary clarity on these essential concepts.
arXiv Detail & Related papers (2024-04-29T18:51:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.