Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection
- URL: http://arxiv.org/abs/2511.07364v1
- Date: Mon, 10 Nov 2025 18:19:51 GMT
- Title: Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection
- Authors: Vaibhav Mavi, Shubh Jaroria, Weiqi Sun,
- Abstract summary: Self-evaluating large language models (LLMs) provides meaningful confidence estimates in complex reasoning.<n>Stepwise evaluation generally outperforms holistic scoring in detecting potential errors.
- Score: 1.1087735229999818
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating LLM-scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.
Related papers
- Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z) - BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents [58.05949210993854]
We investigate whether search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions.<n>We propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level.
arXiv Detail & Related papers (2025-10-27T15:58:51Z) - Enhancing Uncertainty Estimation in LLMs with Expectation of Aggregated Internal Belief [6.1929548590367505]
Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language tasks, but often exhibit overconfidence and generate plausible yet incorrect answers.<n>This overconfidence poses significant challenges for reliable uncertainty estimation and safe deployment.<n>We propose a novel self-evaluation-based calibration method that leverages the internal hidden states of LLMs to derive more accurate confidence scores.
arXiv Detail & Related papers (2025-09-01T15:50:10Z) - Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation [63.49409574310576]
Large language models (LLMs) exhibit overconfidence, assigning high confidence scores to incorrect predictions.<n>We introduce FineCE, a novel confidence estimation method that delivers accurate, fine-grained confidence scores during text generation.<n>Our code and all baselines used in the paper are available on GitHub.
arXiv Detail & Related papers (2025-08-16T13:29:35Z) - MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration [2.1824579248418017]
We present MMBoundary, a novel framework that advances the knowledge boundary awareness of MLLMs through reasoning step confidence calibration.<n>In addition to supervised fine-tuning, we introduce a reinforcement learning stage with multiple reward functions for further aligning model knowledge.<n> Empirical results show that MMBoundary significantly outperforms existing methods across diverse domain datasets and metrics.
arXiv Detail & Related papers (2025-05-29T08:14:40Z) - PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z) - Do LLMs estimate uncertainty well in instruction-following? [9.081508933326644]
Large language models (LLMs) could be valuable personal AI agents across various domains, provided they can precisely follow user instructions.<n>We present the first systematic evaluation of the uncertainty estimation abilities of LLMs in the context of instruction-following.<n>Our findings show that existing uncertainty methods struggle, particularly when models make subtle errors in instruction following.
arXiv Detail & Related papers (2024-10-18T16:32:10Z) - Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility.
We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge.
Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z) - Confidence Estimation for LLM-Based Dialogue State Tracking [9.305763502526833]
Estimation of a model's confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs)
We provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs.
Our findings suggest that fine-tuning open-weight LLMs can result in enhanced AUC performance, indicating better confidence score calibration.
arXiv Detail & Related papers (2024-09-15T06:44:26Z) - Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - Evaluating Uncertainty-based Failure Detection for Closed-Loop LLM Planners [10.746821861109176]
Large Language Models (LLMs) have witnessed remarkable performance as zero-shot task planners for robotic tasks.<n>However, the open-loop nature of previous works makes LLM-based planning error-prone and fragile.<n>In this work, we introduce a framework for closed-loop LLM-based planning called KnowLoop, backed by an uncertainty-based MLLMs failure detector.
arXiv Detail & Related papers (2024-06-01T12:52:06Z) - Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection [90.71323430635593]
We propose a novel self-detection paradigm that considers the comprehensive answer space beyond LLM-generated answers.
Building upon this paradigm, we introduce a two-step framework, which firstly instructs LLM to reflect and provide justifications for each candidate answer.
This framework can be seamlessly integrated with existing approaches for superior self-detection.
arXiv Detail & Related papers (2024-03-15T02:38:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.