Related papers: Reasoning or Fluency? Dissecting Probabilistic Confidence in Best-of-N Selection

Reasoning or Fluency? Dissecting Probabilistic Confidence in Best-of-N Selection

URL: http://arxiv.org/abs/2601.13735v1
Date: Tue, 20 Jan 2026 08:46:33 GMT
Title: Reasoning or Fluency? Dissecting Probabilistic Confidence in Best-of-N Selection
Authors: Hojin Kim, Jaehyung Kim,
Abstract summary: We introduce three classes of inter-step causality perturbations that systematically disrupt dependencies between reasoning steps.<n>We find that selection accuracy degrades only marginally under these disruptions.<n>We propose a contrastive causality metric that explicitly isolates inter-step causal dependencies.
Score: 6.612630497074871
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Probabilistic confidence metrics are increasingly adopted as proxies for reasoning quality in Best-of-N selection, under the assumption that higher confidence reflects higher reasoning fidelity. In this work, we challenge this assumption by investigating whether these metrics truly capture inter-step causal dependencies necessary for valid reasoning. We introduce three classes of inter-step causality perturbations that systematically disrupt dependencies between reasoning steps while preserving local fluency. Surprisingly, across diverse model families and reasoning benchmarks, we find that selection accuracy degrades only marginally under these disruptions. Even severe interventions, such as applying hard attention masks that directly prevent the model from attending to prior reasoning steps, do not substantially reduce selection performance. These findings provide strong evidence that current probabilistic metrics are largely insensitive to logical structure, and primarily capture surface-level fluency or in-distribution priors instead. Motivated by this gap, we propose a contrastive causality metric that explicitly isolates inter-step causal dependencies, and demonstrate that it yields more faithful output selection than existing probability-based approaches.

Related papers

Instrumental and Proximal Causal Inference with Gaussian Processes [24.834836610250765]
We propose a framework for uncertainty-aware causal learning.<n>Our formulation recovers popular kernel estimators as the posterior mean, ensuring predictive precision.<n> Empirical results demonstrate strong predictive performance alongside informative EU quantification.
arXiv Detail & Related papers (2026-03-02T18:23:26Z)
Towards Anytime-Valid Statistical Watermarking [63.02116925616554]
We develop the first e-value-based watermarking framework, Anchored E-Watermarking, that unifies optimal sampling with anytime-valid inference.<n>Our framework can significantly enhance sample efficiency, reducing the average token budget required for detection by 13-15% relative to state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-19T18:32:26Z)
Effect-Level Validation for Causal Discovery [1.8192444294441061]
Causal discovery is increasingly applied to large-scale telemetry data to estimate the effects of user-facing interventions.<n>But its reliability for decision-making in feedback-driven systems with strong self-selection remains unclear.<n>We propose an effect-centric, admissibility-first framework that treats discovered graphs as structural hypotheses.
arXiv Detail & Related papers (2026-02-09T07:26:55Z)
The Silent Scholar Problem: A Probabilistic Framework for Breaking Epistemic Asymmetry in LLM Agents [0.6117371161379209]
We propose a formal probabilistic framework that provides agents with a non-altruistic motive for bidirectional knowledge exchange.<n>We show how these accumulated belief states serve as verifiable reward signals for Reinforcement Learning from Human Feedback (RLHF) and high-quality data filters for Supervised Fine-Tuning (SFT)<n> Simulation results validate that this uncertainty-driven strategy significantly outperforms random baselines in heterogeneous environments.
arXiv Detail & Related papers (2025-12-24T02:02:25Z)
ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning [2.1461777157838724]
We introduce ReasonBENCH, the first benchmark designed to quantify the underlying instability in large language models (LLMs) reasoning.<n>Across tasks from different domains, we find that the vast majority of reasoning strategies and models exhibit high instability.<n>We further analyze the impact of prompts, model families, and scale on the trade-off between solve rate and stability.
arXiv Detail & Related papers (2025-12-08T18:26:58Z)
How Reliable are Causal Probing Interventions? [5.599792629509229]
Causal probing aims to analyze foundation models by examining how intervening on their representation impacts their outputs.<n>Recent works have cast doubt on the theoretical basis of several leading causal probing methods.
arXiv Detail & Related papers (2024-08-28T03:45:49Z)
When Does Confidence-Based Cascade Deferral Suffice? [69.28314307469381]
Cascades are a classical strategy to enable inference cost to vary adaptively across samples. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. Despite being oblivious to the structure of the cascade, confidence-based deferral often works remarkably well in practice.
arXiv Detail & Related papers (2023-07-06T04:13:57Z)
Advancing Counterfactual Inference through Nonlinear Quantile Regression [77.28323341329461]
We propose a framework for efficient and effective counterfactual inference implemented with neural networks. The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data. Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions.
arXiv Detail & Related papers (2023-06-09T08:30:51Z)
Fairness and robustness in anti-causal prediction [73.693135253335]
Robustness to distribution shift and fairness have independently emerged as two important desiderata required of machine learning models. While these two desiderata seem related, the connection between them is often unclear in practice. By taking this perspective, we draw explicit connections between a common fairness criterion - separation - and a common notion of robustness.
arXiv Detail & Related papers (2022-09-20T02:41:17Z)
Multi-label Chaining with Imprecise Probabilities [0.0]
We present two different strategies to extend the classical multi-label chaining approach to handle imprecise probability estimates. The main reasons one could have for using such estimations are (1) to make cautious predictions when a high uncertainty is detected in the chaining and (2) to make better precise predictions by avoiding biases caused in early decisions in the chaining. Our experimental results on missing labels, which investigate how reliable these predictions are in both approaches, indicate that our approaches produce relevant cautiousness on those hard-to-predict instances where the precise models fail.
arXiv Detail & Related papers (2021-07-15T16:43:31Z)
Deconfounding Scores: Feature Representations for Causal Effect Estimation with Weak Overlap [140.98628848491146]
We introduce deconfounding scores, which induce better overlap without biasing the target of estimation. We show that deconfounding scores satisfy a zero-covariance condition that is identifiable in observed data. In particular, we show that this technique could be an attractive alternative to standard regularizations.
arXiv Detail & Related papers (2021-04-12T18:50:11Z)
Preferential Structures for Comparative Probabilistic Reasoning [2.0646127669654826]
We show that a natural modification of the preferential approach yields exactly the same logical system as a probabilistic approach. The same preferential structures used in the study of non-monotonic logics and belief revision may be used in the study of comparative probabilistic reasoning.
arXiv Detail & Related papers (2021-04-06T05:00:20Z)
Latent Causal Invariant Model [128.7508609492542]
Current supervised learning can learn spurious correlation during the data-fitting process. We propose a Latent Causal Invariance Model (LaCIM) which pursues causal prediction.
arXiv Detail & Related papers (2020-11-04T10:00:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.