Knowing When Not to Answer: Abstention-Aware Scientific Reasoning
- URL: http://arxiv.org/abs/2602.14189v1
- Date: Sun, 15 Feb 2026 15:29:43 GMT
- Title: Knowing When Not to Answer: Abstention-Aware Scientific Reasoning
- Authors: Samir Abdaljalil, Erchin Serpedin, Hasan Kurban,
- Abstract summary: In scientific settings, unsupported or uncertain conclusions can be more harmful than abstaining.<n>We study this problem through an abstention-aware verification framework.<n>We evaluate this framework across two scientific benchmarks: SciFact and PubMedQA.
- Score: 2.680633756465714
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models are increasingly used to answer and verify scientific claims, yet existing evaluations typically assume that a model must always produce a definitive answer. In scientific settings, however, unsupported or uncertain conclusions can be more harmful than abstaining. We study this problem through an abstention-aware verification framework that decomposes scientific claims into minimal conditions, audits each condition against available evidence using natural language inference (NLI), and selectively decides whether to support, refute, or abstain. We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed-book and open-domain evidence settings. Experiments are conducted with six diverse language models, including encoder-decoder, open-weight chat models, and proprietary APIs. Across all benchmarks and models, we observe that raw accuracy varies only modestly across architectures, while abstention plays a critical role in controlling error. In particular, confidence-based abstention substantially reduces risk at moderate coverage levels, even when absolute accuracy improvements are limited. Our results suggest that in scientific reasoning tasks, the primary challenge is not selecting a single best model, but rather determining when available evidence is sufficient to justify an answer. This work highlights abstention-aware evaluation as a practical and model-agnostic lens for assessing scientific reliability, and provides a unified experimental basis for future work on selective reasoning in scientific domains. Code is available at https://github.com/sabdaljalil2000/ai4science .
Related papers
- SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence [60.202862987441684]
We introduce scientific instruction following: the capability to solve problems while strictly adhering to the constraints that establish scientific validity.<n>Specifically, we introduce SciIF, a multi-discipline benchmark that evaluates this capability by pairing university-level problems with a fixed catalog of constraints.<n>By measuring both solution correctness and multi-constraint adherence, SciIF enables finegrained diagnosis of compositional reasoning failures.
arXiv Detail & Related papers (2026-01-08T09:45:58Z) - AlignCheck: a Semantic Open-Domain Metric for Factual Consistency Assessment [0.0]
We propose an interpretable framework for factual consistency assessment for in-domain and open-domain texts.<n>Our approach decomposes text into atomic facts and introduces a flexible, schema-free methodology.<n>We benchmark our approach on popular general and clinical datasets and release our code to support fact-aware model training.
arXiv Detail & Related papers (2025-12-03T10:14:31Z) - Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity [15.774418410083515]
We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching.<n>We reveal a striking disconnect between surface performance and reasoning fidelity.<n>Our diagnostics expose reasoning failures invisible to traditional accuracy metrics.
arXiv Detail & Related papers (2025-11-29T16:47:01Z) - Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework [55.078301794183496]
We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic.<n>This involves evaluating the internal consistency between a paper's results, interpretations, and claims.<n>We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions.
arXiv Detail & Related papers (2025-08-29T08:48:00Z) - Atomic Reasoning for Scientific Table Claim Verification [83.14588611859826]
Non-experts are susceptible to misleading claims based on scientific tables due to their high information density and perceived credibility.<n>Existing table claim verification models, including state-of-the-art large language models (LLMs), often struggle with precise fine-grained reasoning.<n>Inspired by Cognitive Load Theory, we propose that enhancing a model's ability to interpret table-based claims involves reducing cognitive load.
arXiv Detail & Related papers (2025-06-08T02:46:22Z) - Prediction-Powered Causal Inferences [59.98498488132307]
We focus on Prediction-Powered Causal Inferences (PPCI)<n>We first show that conditional calibration guarantees valid PPCI at population level.<n>We then introduce a sufficient representation constraint transferring validity across experiments.
arXiv Detail & Related papers (2025-02-10T10:52:17Z) - FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees [41.78390564658645]
Large Language Models (LLMs) to generate hallucinations and non-factual content undermines their reliability in high-stakes domains.
We introduce FactTest, a novel framework that statistically assesses whether a LLM can confidently provide correct answers to given questions.
We show that FactTest effectively detects hallucinations and improves the model's ability to abstain from answering unknown questions, leading to an over 40% accuracy improvement.
arXiv Detail & Related papers (2024-11-04T20:53:04Z) - Smoke and Mirrors in Causal Downstream Tasks [59.90654397037007]
This paper looks at the causal inference task of treatment effect estimation, where the outcome of interest is recorded in high-dimensional observations.<n>We compare 6 480 models fine-tuned from state-of-the-art visual backbones, and find that the sampling and modeling choices significantly affect the accuracy of the causal estimate.<n>Our results suggest that future benchmarks should carefully consider real downstream scientific questions, especially causal ones.
arXiv Detail & Related papers (2024-05-27T13:26:34Z) - Large Language Models for Automated Open-domain Scientific Hypotheses Discovery [50.40483334131271]
This work proposes the first dataset for social science academic hypotheses discovery.
Unlike previous settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and (2) proposing hypotheses even new to humanity.
A multi- module framework is developed for the task, including three different feedback mechanisms to boost performance.
arXiv Detail & Related papers (2023-09-06T05:19:41Z) - Testing Causality in Scientific Modelling Software [0.26388783516590225]
Causal Testing Framework is a framework that uses Causal Inference techniques to establish causal effects from existing data.
We present three case studies covering real-world scientific models, demonstrating how the Causal Testing Framework can infer metamorphic test outcomes.
arXiv Detail & Related papers (2022-09-01T10:57:54Z) - RerrFact: Reduced Evidence Retrieval Representations for Scientific
Claim Verification [4.052777228128475]
We propose a modular approach that sequentially carries out binary classification for every prediction subtask.
We carry out two-step stance predictions that first differentiate non-relevant rationales and then identify supporting or refuting rationales for a given claim.
Experimentally, our system RerrFact with no fine-tuning, simple design, and a fraction of model parameters fairs competitively on the leaderboard.
arXiv Detail & Related papers (2022-02-05T21:52:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.