Related papers: Towards Reducible Uncertainty Modeling for Reliable Large Language Model Agents

Towards Reducible Uncertainty Modeling for Reliable Large Language Model Agents

URL: http://arxiv.org/abs/2602.05073v1
Date: Wed, 04 Feb 2026 21:47:40 GMT
Title: Towards Reducible Uncertainty Modeling for Reliable Large Language Model Agents
Authors: Changdae Oh, Seongheon Park, To Eun Kim, Jiatong Li, Wendi Li, Samuel Yeh, Xuefeng Du, Hamed Hassani, Paul Bogdan, Dawn Song, Sharon Li,
Abstract summary: Uncertainty quantification (UQ) for large language models (LLMs) is a key building block for safety guardrails of daily LLM applications.<n>This paper presents the first general formulation of agent UQ that subsumes broad classes of existing UQ setups.<n>We propose a novel perspective, a conditional uncertainty reduction process, that explicitly models reducible uncertainty over an agent's trajectory.
Score: 72.26774492844167
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Uncertainty quantification (UQ) for large language models (LLMs) is a key building block for safety guardrails of daily LLM applications. Yet, even as LLM agents are increasingly deployed in highly complex tasks, most UQ research still centers on single-turn question-answering. We argue that UQ research must shift to realistic settings with interactive agents, and that a new principled framework for agent UQ is needed. This paper presents the first general formulation of agent UQ that subsumes broad classes of existing UQ setups. Under this formulation, we show that prior works implicitly treat LLM UQ as an uncertainty accumulation process, a viewpoint that breaks down for interactive agents in an open world. In contrast, we propose a novel perspective, a conditional uncertainty reduction process, that explicitly models reducible uncertainty over an agent's trajectory by highlighting "interactivity" of actions. From this perspective, we outline a conceptual framework to provide actionable guidance for designing UQ in LLM agent setups. Finally, we conclude with practical implications of the agent UQ in frontier LLM development and domain-specific applications, as well as open remaining problems.

Related papers

LLM Agents Beyond Utility: An Open-Ended Perspective [50.809163251551894]
We augment a pretrained LLM agent with the ability to generate its own tasks, accumulate knowledge, and interact extensively with its environment.<n>It can reliably follow complex multi-step instructions, store and reuse information across runs, and propose and solve its own tasks.<n>It remains sensitive to prompt design, prone to repetitive task generation, and unable to form self-representations.
arXiv Detail & Related papers (2025-10-16T10:46:54Z)
SIMBA UQ: Similarity-Based Aggregation for Uncertainty Quantification in Large Language Models [17.805673311465295]
Uncertainty quantification (UQ) provides measures of uncertainty.<n>Black-box UQ methods do not require access to internal model information.<n>We propose a high-level non-verbalized similarity-based aggregation framework.
arXiv Detail & Related papers (2025-10-10T17:22:53Z)
AgentAsk: Multi-Agent Systems Need to Ask [26.13279490836716]
Multi-agent systems built on large language models (LLMs) promise enhanced problem-solving capabilities through collaborative division of labor.<n>We propose AgentAsk, a lightweight and plug-and-play clarification module that treats every inter-agent message as a potential failure point and inserts minimally necessary questions to arrest error propagation.<n>AgentAsk consistently improves accuracy and robustness over public multi-agent implementations while keeping overhead minimal, with latency and extra cost all less than 5%.
arXiv Detail & Related papers (2025-10-08T22:36:05Z)
A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA [65.38186593873313]
Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise.<n>We introduce a proof-of-concept multi-call framework for MHQA, InfoQA.<n>We construct a stringent and noise-rich benchmark to validate our theory and framework.
arXiv Detail & Related papers (2025-09-25T14:11:57Z)
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios [51.46347732659174]
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications.<n>AgentIF is the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios.
arXiv Detail & Related papers (2025-05-22T17:31:10Z)
Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization [80.09112808413133]
Mujica is a planner that decomposes questions into acyclic graph of subquestions and a worker that resolves questions via retrieval and reasoning.<n>MyGO is a novel reinforcement learning method that replaces traditional policy updates with gradient Likelihood Maximum Estimation.<n> Empirical results across multiple datasets demonstrate the effectiveness of MujicaMyGO in enhancing multi-hop QA performance.
arXiv Detail & Related papers (2025-05-20T18:33:03Z)
A Survey of Large Language Model Agents for Question Answering [0.7416846035207727]
This paper surveys the development of large language model (LLM)-based agents for question answering (QA)<n>Traditional agents face significant limitations, including substantial data requirements and difficulty in generalizing to new environments.<n>LLM-based agents address these challenges by leveraging LLMs as their core reasoning engine.
arXiv Detail & Related papers (2025-03-24T23:39:44Z)
CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought [10.166370877826486]
Large language models (LLMs) excel in many tasks but struggle to accurately quantify uncertainty in their generated responses.<n>Existing uncertainty quantification (UQ) methods for LLMs are primarily prompt-wise rather than response-wise.<n>We propose CoT-UQ, a response-wise UQ framework that integrates LLMs' inherent reasoning capabilities through Chain-of-Thought.
arXiv Detail & Related papers (2025-02-24T14:48:06Z)
Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency [66.96286531087549]
Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompass a variety of approaches.<n>We propose a novel approach to integrating model confidence with output consistency, resulting in a family of efficient and robust UQ methods.<n>We evaluate our approach across various tasks such as question answering, abstractive summarization, and machine translation.
arXiv Detail & Related papers (2025-02-07T14:30:12Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.<n>We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.<n>We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)
Formally Specifying the High-Level Behavior of LLM-Based Agents [24.645319505305316]
LLMs have emerged as promising tools for solving challenging problems without the need for task-specific finetuned models. Currently, the design and implementation of such agents is ad hoc, as the wide variety of tasks that LLM-based agents may be applied to naturally means there can be no one-size-fits-all approach to agent design. We propose a minimalistic generation framework that simplifies the process of building agents.
arXiv Detail & Related papers (2023-10-12T17:24:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.