Related papers: Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection

Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection

URL: http://arxiv.org/abs/2511.07364v1
Date: Mon, 10 Nov 2025 18:19:51 GMT
Title: Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection
Authors: Vaibhav Mavi, Shubh Jaroria, Weiqi Sun,
Abstract summary: Self-evaluating large language models (LLMs) provides meaningful confidence estimates in complex reasoning.<n>Stepwise evaluation generally outperforms holistic scoring in detecting potential errors.
Score: 1.1087735229999818
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating LLM-scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.

Related papers

Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z)
BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents [58.05949210993854]
We investigate whether search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions.<n>We propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level.
arXiv Detail & Related papers (2025-10-27T15:58:51Z)
Enhancing Uncertainty Estimation in LLMs with Expectation of Aggregated Internal Belief [6.1929548590367505]
Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language tasks, but often exhibit overconfidence and generate plausible yet incorrect answers.<n>This overconfidence poses significant challenges for reliable uncertainty estimation and safe deployment.<n>We propose a novel self-evaluation-based calibration method that leverages the internal hidden states of LLMs to derive more accurate confidence scores.
arXiv Detail & Related papers (2025-09-01T15:50:10Z)
Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation [63.49409574310576]
Large language models (LLMs) exhibit overconfidence, assigning high confidence scores to incorrect predictions.<n>We introduce FineCE, a novel confidence estimation method that delivers accurate, fine-grained confidence scores during text generation.<n>Our code and all baselines used in the paper are available on GitHub.
arXiv Detail & Related papers (2025-08-16T13:29:35Z)
MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration [2.1824579248418017]
We present MMBoundary, a novel framework that advances the knowledge boundary awareness of MLLMs through reasoning step confidence calibration.<n>In addition to supervised fine-tuning, we introduce a reinforcement learning stage with multiple reward functions for further aligning model knowledge.<n> Empirical results show that MMBoundary significantly outperforms existing methods across diverse domain datasets and metrics.
arXiv Detail & Related papers (2025-05-29T08:14:40Z)
PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z)
Do LLMs estimate uncertainty well in instruction-following? [9.081508933326644]
Large language models (LLMs) could be valuable personal AI agents across various domains, provided they can precisely follow user instructions.<n>We present the first systematic evaluation of the uncertainty estimation abilities of LLMs in the context of instruction-following.<n>Our findings show that existing uncertainty methods struggle, particularly when models make subtle errors in instruction following.
arXiv Detail & Related papers (2024-10-18T16:32:10Z)
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z)
Confidence Estimation for LLM-Based Dialogue State Tracking [9.305763502526833]
Estimation of a model's confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs) We provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs. Our findings suggest that fine-tuning open-weight LLMs can result in enhanced AUC performance, indicating better confidence score calibration.
arXiv Detail & Related papers (2024-09-15T06:44:26Z)
Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
Evaluating Uncertainty-based Failure Detection for Closed-Loop LLM Planners [10.746821861109176]
Large Language Models (LLMs) have witnessed remarkable performance as zero-shot task planners for robotic tasks.<n>However, the open-loop nature of previous works makes LLM-based planning error-prone and fragile.<n>In this work, we introduce a framework for closed-loop LLM-based planning called KnowLoop, backed by an uncertainty-based MLLMs failure detector.
arXiv Detail & Related papers (2024-06-01T12:52:06Z)
Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection [90.71323430635593]
We propose a novel self-detection paradigm that considers the comprehensive answer space beyond LLM-generated answers. Building upon this paradigm, we introduce a two-step framework, which firstly instructs LLM to reflect and provide justifications for each candidate answer. This framework can be seamlessly integrated with existing approaches for superior self-detection.
arXiv Detail & Related papers (2024-03-15T02:38:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.