Related papers: SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence

SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence

URL: http://arxiv.org/abs/2601.04770v2
Date: Mon, 12 Jan 2026 02:43:34 GMT
Title: SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence
Authors: Encheng Su, Jianyu Wu, Chen Tang, Lintao Wang, Pengze Li, Aoran Wang, Jinouwen Zhang, Yizhou Wang, Yuan Meng, Xinzhu Ma, Shixiang Tang, Houqiang Li,
Abstract summary: We introduce scientific instruction following: the capability to solve problems while strictly adhering to the constraints that establish scientific validity.<n>Specifically, we introduce SciIF, a multi-discipline benchmark that evaluates this capability by pairing university-level problems with a fixed catalog of constraints.<n>By measuring both solution correctness and multi-constraint adherence, SciIF enables finegrained diagnosis of compositional reasoning failures.
Score: 60.202862987441684
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) transition from general knowledge retrieval to complex scientific discovery, their evaluation standards must also incorporate the rigorous norms of scientific inquiry. Existing benchmarks exhibit a critical blind spot: general instruction-following metrics focus on superficial formatting, while domain-specific scientific benchmarks assess only final-answer correctness, often rewarding models that arrive at the right result with the wrong reasons. To address this gap, we introduce scientific instruction following: the capability to solve problems while strictly adhering to the constraints that establish scientific validity. Specifically, we introduce SciIF, a multi-discipline benchmark that evaluates this capability by pairing university-level problems with a fixed catalog of constraints across three pillars: scientific conditions (e.g., boundary checks and assumptions), semantic stability (e.g., unit and symbol conventions), and specific processes(e.g., required numerical methods). Uniquely, SciIF emphasizes auditability, requiring models to provide explicit evidence of constraint satisfaction rather than implicit compliance. By measuring both solution correctness and multi-constraint adherence, SciIF enables finegrained diagnosis of compositional reasoning failures, ensuring that LLMs can function as reliable agents within the strict logical frameworks of science.

Related papers

Knowing When Not to Answer: Abstention-Aware Scientific Reasoning [2.680633756465714]
In scientific settings, unsupported or uncertain conclusions can be more harmful than abstaining.<n>We study this problem through an abstention-aware verification framework.<n>We evaluate this framework across two scientific benchmarks: SciFact and PubMedQA.
arXiv Detail & Related papers (2026-02-15T15:29:43Z)
CircuChain: Disentangling Competence and Compliance in LLM Circuit Analysis [0.0]
We introduce CircuChain, a diagnostic benchmark designed to disentangle instruction compliance from physical reasoning competence in electrical circuit analysis.<n>A multi-stage verification pipeline, combining symbolic solvers, SPICE simulation, and an LLM-based error taxonomy, enables fine-grained attribution of failures to convention errors.<n>The strongest model evaluated exhibits near-perfect physical reasoning but a high rate of convention violations when Trap conditions deliberately invert natural sign patterns.
arXiv Detail & Related papers (2026-01-29T06:13:44Z)
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows [203.3527268311731]
We present an operational SGI definition grounded in the Practical Inquiry Model (PIM)<n>We operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning.<n>Our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
arXiv Detail & Related papers (2025-12-18T12:44:36Z)
ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning [118.46980291324148]
ATLAS is a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems.<n>Its key features include: High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage.<n>Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities.
arXiv Detail & Related papers (2025-11-18T11:13:06Z)
PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning [57.868248683256574]
PRISM-Physics is a process-level evaluation framework and benchmark for complex physics reasoning problems.<n> Solutions are represented as directed acyclic graphs (DAGs) of formulas.<n>Results show that our evaluation framework is aligned with human experts' scoring.
arXiv Detail & Related papers (2025-10-03T17:09:03Z)
SCI-Verifier: Scientific Verifier with Thinking [37.08904000514563]
Large language models (LLMs) are increasingly applied to scientific reasoning.<n>Existing verification studies in scientific domains suffer from two major limitations.<n>We propose solutions at both the data and model levels.
arXiv Detail & Related papers (2025-09-29T04:58:43Z)
Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning [53.82037883518254]
We introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks.<n>We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks.
arXiv Detail & Related papers (2025-08-26T17:04:23Z)
Atomic Reasoning for Scientific Table Claim Verification [83.14588611859826]
Non-experts are susceptible to misleading claims based on scientific tables due to their high information density and perceived credibility.<n>Existing table claim verification models, including state-of-the-art large language models (LLMs), often struggle with precise fine-grained reasoning.<n>Inspired by Cognitive Load Theory, we propose that enhancing a model's ability to interpret table-based claims involves reducing cognitive load.
arXiv Detail & Related papers (2025-06-08T02:46:22Z)
On the Rigour of Scientific Writing: Criteria, Analysis, and Insights [15.055289544883534]
Rigour is crucial for scientific research as it ensures the validity and validity of results and findings. We introduce a bottom-up, data-driven framework to automatically identify and define rigour criteria. Our framework is domain-agnostic and can be tailored to the evaluation of scientific rigour for different areas.
arXiv Detail & Related papers (2024-10-07T12:22:06Z)
SCITAB: A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables [68.76415918462418]
We present SCITAB, a challenging evaluation dataset consisting of 1.2K expert-verified scientific claims. Through extensive evaluations, we demonstrate that SCITAB poses a significant challenge to state-of-the-art models. Our analysis uncovers several unique challenges posed by SCITAB, including table grounding, claim ambiguity, and compositional reasoning.
arXiv Detail & Related papers (2023-05-22T16:13:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.