Related papers: Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection

Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection

URL: http://arxiv.org/abs/2510.21049v1
Date: Thu, 23 Oct 2025 23:23:36 GMT
Title: Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection
Authors: Atoosa Chegini, Hamid Kazemi, Garrett Souza, Maria Safi, Yang Song, Samy Bengio, Sinead Williamson, Mehrdad Farajtabar,
Abstract summary: Reasoning has become a central paradigm for large language models (LLMs)<n>We present the first systematic study of reasoning for classification tasks under strict low false positive rate regimes.<n>Our results reveal a clear trade-off: Think On (reasoning-augmented) generation improves overall accuracy, but underperforms at the low-FPR thresholds essential for practical use.
Score: 21.190105743961798
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reasoning has become a central paradigm for large language models (LLMs), consistently boosting accuracy across diverse benchmarks. Yet its suitability for precision-sensitive tasks remains unclear. We present the first systematic study of reasoning for classification tasks under strict low false positive rate (FPR) regimes. Our analysis covers two tasks--safety detection and hallucination detection--evaluated in both fine-tuned and zero-shot settings, using standard LLMs and Large Reasoning Models (LRMs). Our results reveal a clear trade-off: Think On (reasoning-augmented) generation improves overall accuracy, but underperforms at the low-FPR thresholds essential for practical use. In contrast, Think Off (no reasoning during inference) dominates in these precision-sensitive regimes, with Think On surpassing only when higher FPRs are acceptable. In addition, we find token-based scoring substantially outperforms self-verbalized confidence for precision-sensitive deployments. Finally, a simple ensemble of the two modes recovers the strengths of each. Taken together, our findings position reasoning as a double-edged tool: beneficial for average accuracy, but often ill-suited for applications requiring strict precision.

Related papers

Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning [58.331709210563616]
Thinking by Subtraction is a confidence-driven contrastive decoding approach.<n>A small subset of low-confidence tokens disproportionately contributes to reasoning errors and unnecessary output expansion.<n>Our method, Confidence-Driven Contrastive Decoding, detects low-confidence tokens during decoding and intervenes at these positions.
arXiv Detail & Related papers (2026-02-20T14:13:22Z)
Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say "I Don't Know" [47.930782177987446]
Large language models often struggle to recognize their knowledge limits in closed-book question answering, leading to confident hallucinations.<n>We evaluate three task-equivalent prompting regimes: Direct, Assistive, and Incremental, across different model scales and multi-hop QA benchmarks.<n>Because factual knowledge is stable while hallucinations are agreement, cross-regime provides a precise signal of internal uncertainty.
arXiv Detail & Related papers (2026-02-04T18:39:58Z)
Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models [108.26461635308796]
We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment.<n>Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models.<n>We introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training.
arXiv Detail & Related papers (2026-02-04T15:24:52Z)
A Comprehensive Evaluation of LLM Reasoning: From Single-Model to Multi-Agent Paradigms [20.241519889633285]
Large Language Models (LLMs) are increasingly deployed as reasoning systems, where reasoning paradigms play a critical role.<n>We conduct a comprehensive and unified evaluation of reasoning paradigms, spanning direct single-model generation, CoT-augmented single-model reasoning, and representative MAS.<n>We introduce MIMeBench, a new open-ended benchmark that targets two foundational yet underexplored semantic capabilities.
arXiv Detail & Related papers (2026-01-19T17:23:45Z)
Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning [32.32593439144886]
Behavior-calibrated reinforcement learning allows smaller models to surpass frontier models in uncertainty quantification.<n>Our model's log-scale Accuracy-to-Hallucination Ratio gain (0.806) exceeds GPT-5's (0.207) in a challenging in-domain evaluation.
arXiv Detail & Related papers (2025-12-22T22:51:48Z)
Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness [4.129847064263056]
We systematically evaluate the performance of Large Language Models for rubric-based short-answer grading.<n>We find that alignment is strong for binary tasks but degrades with increased rubric granularity.<n>Experiments reveal that while the model is resilient to prompt injection, it is sensitive to synonym substitutions.
arXiv Detail & Related papers (2025-12-21T05:22:04Z)
Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B [1.948261185683419]
We investigate whether "evaluation scent" inflates measured performance without commensurate capability gains.<n>We run six paired A/B scenarios that hold task content and decoding fixed while varying framing.<n>We provide a reproducible A/B framework (prompt banks, validators, per-run scores, scripts) and practical guidance.
arXiv Detail & Related papers (2025-10-08T09:49:05Z)
Judging with Confidence: Calibrating Autoraters to Preference Distributions [56.17041629492863]
We argue that a reliable autorater must learn to model the full distribution of preferences defined by a target population.<n>We present two learning methods tailored to different data conditions.<n>Our results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution.
arXiv Detail & Related papers (2025-09-30T20:36:41Z)
Inducing Faithfulness in Structured Reasoning via Counterfactual Sensitivity [6.908972852063454]
Large language models often generate a correct answer while relying on a flawed or irrelevant reasoning trace.<n>This paper introduces textbfCounterfactual Sensitivity Regularization (CSR), a novel training objective.<n>CSR improves faithfulness over standard fine-tuning and process supervision by up to 70 percentage points.
arXiv Detail & Related papers (2025-09-01T15:18:46Z)
Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty [59.97939500426759]
This paper describes RLCR, an approach to training reasoning models that jointly improves accuracy and confidence estimation.<n>We show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy.<n>We also demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration.
arXiv Detail & Related papers (2025-07-22T17:56:01Z)
Towards Evaluting Fake Reasoning Bias in Language Models [47.482898076525494]
We show that models favor the surface structure of reasoning even when the logic is flawed.<n>We introduce THEATER, a benchmark that systematically investigates Fake Reasoning Bias (FRB)<n>We evaluate 17 advanced Large Language Models (LRMs) on both subjective DPO and factual datasets.
arXiv Detail & Related papers (2025-07-18T09:06:10Z)
Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models [0.6091702876917281]
Large Language Models (LLMs) show remarkable proficiency in natural language tasks.<n>Overconfidence-misalignment between predicted confidence and true correctness poses significant risks in critical decision-making applications.<n>We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering datasets.
arXiv Detail & Related papers (2025-02-16T07:46:09Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
Are We Really Achieving Better Beyond-Accuracy Performance in Next Basket Recommendation? [57.91114305844153]
Next basket recommendation (NBR) is a special type of sequential recommendation that is increasingly receiving attention. Recent studies into NBR have found a substantial performance difference between recommending repeat items and explore items. We propose a plug-and-play two-step repetition-exploration framework that treats repeat items and explores items separately.
arXiv Detail & Related papers (2024-05-02T09:59:35Z)
Worst Case Matters for Few-Shot Recognition [27.023352955311502]
Few-shot recognition learns a recognition model with very few (e.g., 1 or 5) images per category. Current few-shot learning methods focus on improving the average accuracy over many episodes. We argue that in real-world applications we may often only try one episode instead of many, and hence maximizing the worst-case accuracy is more important than maximizing the average accuracy.
arXiv Detail & Related papers (2022-03-13T05:39:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.