Related papers: When Small Models Are Right for Wrong Reasons: Process Verification for Trustworthy Agents

When Small Models Are Right for Wrong Reasons: Process Verification for Trustworthy Agents

URL: http://arxiv.org/abs/2601.00513v1
Date: Thu, 01 Jan 2026 23:54:15 GMT
Title: When Small Models Are Right for Wrong Reasons: Process Verification for Trustworthy Agents
Authors: Laksh Advani,
Abstract summary: We reveal a critical reliability crisis: 50-69% of correct answers from small language models contain fundamentally flawed reasoning.<n>We introduce the Reasoning Integrity Score (RIS), a process-based metric validated with substantial inter-rater agreement.<n>We show RAG succeeds by grounding calculations in external evidence, reducing errors by 7.6%, while meta-cognition amplifies confusion without sufficient model capacity.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deploying small language models (7-9B parameters) as autonomous agents requires trust in their reasoning, not just their outputs. We reveal a critical reliability crisis: 50-69\% of correct answers from these models contain fundamentally flawed reasoning -- a ``Right-for-Wrong-Reasons'' phenomenon invisible to standard accuracy metrics. Through analysis of 10,734 reasoning traces across three models and diverse tasks, we introduce the Reasoning Integrity Score (RIS), a process-based metric validated with substantial inter-rater agreement ($κ=0.657$). Conventional practices are challenged by our findings: while retrieval-augmented generation (RAG) significantly improves reasoning integrity (Cohen's $d=0.23$--$0.93$), meta-cognitive interventions like self-critique often harm performance ($d=-0.14$ to $-0.33$) in small models on the evaluated tasks. Mechanistic analysis reveals RAG succeeds by grounding calculations in external evidence, reducing errors by 7.6\%, while meta-cognition amplifies confusion without sufficient model capacity. To enable deployment, verification capabilities are distilled into a neural classifier achieving 0.86 F1-score with 100$\times$ speedup. These results underscore the necessity of process-based verification for trustworthy agents: accuracy alone is dangerously insufficient when models can be right for entirely wrong reasons.

Related papers

RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models [5.733004743054914]
Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process.<n>We introduce a formal framework for reasoning faithfulness, defined by two testable conditions.<n>We present RFEval, a benchmark of 7,186 instances that probes faithfulness via controlled, output-level counterfactual interventions.
arXiv Detail & Related papers (2026-02-19T03:49:37Z)
Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol [69.11739400975445]
We introduce the first theoretical framework for analyzing error accumulation in Model Context Protocol (MCP) agents.<n>We show that cumulative distortion exhibits linear growth and high-probability deviations bounded by $O(sqrtT)$.<n>Key findings include: semantic weighting reduces distortion by 80%, and periodic re-grounding approximately every 9 steps suffices for error control.
arXiv Detail & Related papers (2026-02-10T21:08:53Z)
Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models [108.26461635308796]
We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment.<n>Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models.<n>We introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training.
arXiv Detail & Related papers (2026-02-04T15:24:52Z)
VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension [51.76841625486355]
Referring Expression (REC) aims to localize the image region corresponding to a natural-language query.<n>Recent neuro-symbolic REC approaches leverage large language models (LLMs) and vision-language models (VLMs) to perform compositional reasoning.<n>We introduce VIRO, a neuro-symbolic framework that embeds lightweight operator-level verifiers within reasoning steps.
arXiv Detail & Related papers (2026-01-19T07:21:19Z)
The Drill-Down and Fabricate Test (DDFT): A Protocol for Measuring Epistemic Robustness in Language Models [0.0]
Current language model evaluations measure what models know under ideal conditions but not how robustly they know it under realistic stress.<n>We introduce the Drill-Down Fabricate Test (DDFT), a protocol that measures robustness.<n>We find flagship models exhibit brittleness despite their scale, while smaller models can achieve robust performance.
arXiv Detail & Related papers (2025-12-29T20:29:09Z)
Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning [32.32593439144886]
Behavior-calibrated reinforcement learning allows smaller models to surpass frontier models in uncertainty quantification.<n>Our model's log-scale Accuracy-to-Hallucination Ratio gain (0.806) exceeds GPT-5's (0.207) in a challenging in-domain evaluation.
arXiv Detail & Related papers (2025-12-22T22:51:48Z)
Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection [21.190105743961798]
Reasoning has become a central paradigm for large language models (LLMs)<n>We present the first systematic study of reasoning for classification tasks under strict low false positive rate regimes.<n>Our results reveal a clear trade-off: Think On (reasoning-augmented) generation improves overall accuracy, but underperforms at the low-FPR thresholds essential for practical use.
arXiv Detail & Related papers (2025-10-23T23:23:36Z)
Certainty-Guided Reasoning in Large Language Models: A Dynamic Thinking Budget Approach [0.15749416770494704]
We show that Certainty-Guided Reasoning (CGR) improves baseline accuracy while reducing token usage.<n>CGR can eliminate millions of tokens in aggregate, with tunable trade-offs between certainty thresholds and efficiency.<n>By integrating confidence into the reasoning process, CGR makes large reasoning language models more adaptive, trustworthy, and resource efficient.
arXiv Detail & Related papers (2025-09-09T14:57:15Z)
Towards Evaluting Fake Reasoning Bias in Language Models [47.482898076525494]
We show that models favor the surface structure of reasoning even when the logic is flawed.<n>We introduce THEATER, a benchmark that systematically investigates Fake Reasoning Bias (FRB)<n>We evaluate 17 advanced Large Language Models (LRMs) on both subjective DPO and factual datasets.
arXiv Detail & Related papers (2025-07-18T09:06:10Z)
Reasoning Models Are More Easily Gaslighted Than You Think [85.84943447589511]
We evaluate three state-of-the-art reasoning models, including OpenAI's o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash.<n>Our evaluation reveals significant accuracy drops following gaslighting negation prompts.<n>We introduce GaslightingBench-R, a new diagnostic benchmark designed to evaluate reasoning models' susceptibility to defend their belief.
arXiv Detail & Related papers (2025-06-11T12:52:25Z)
Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability [70.4107059502882]
Training language models with rationales augmentation has been shown to be beneficial in many existing works.<n>We conduct comprehensive investigations to thoroughly inspect the impact of rationales on model performance.
arXiv Detail & Related papers (2025-05-30T02:39:37Z)
Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [41.19330514054401]
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness.<n>We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems to harmonize reliability and usability.
arXiv Detail & Related papers (2025-03-04T03:16:02Z)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning [53.25336975467293]
We present the first theoretical error decomposition analysis of methods such as perplexity and self-consistency.<n>Our analysis reveals a fundamental trade-off: perplexity methods suffer from substantial model error due to the absence of a proper consistency function.<n>We propose Reasoning-Pruning Perplexity Consistency (RPC), which integrates perplexity with self-consistency, and Reasoning Pruning, which eliminates low-probability reasoning paths.
arXiv Detail & Related papers (2025-02-01T18:09:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.