Task-Awareness Improves LLM Generations and Uncertainty
- URL: http://arxiv.org/abs/2601.21500v1
- Date: Thu, 29 Jan 2026 10:16:23 GMT
- Title: Task-Awareness Improves LLM Generations and Uncertainty
- Authors: Tim Tomov, Dominik Fuchsgruber, Stephan Günnemann,
- Abstract summary: Bayes-optimal responses consistently outperform standard decoding methods like beam search.<n>Our decision-theoretic framework is applicable to any problem that admits a latent response structure.
- Score: 48.857040212979484
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In many applications of LLMs, natural language responses often have an underlying structure such as representing discrete labels, numerical values, or graphs. Yet, existing decoding and uncertainty estimation methods operate only in language space and largely disregard structural information. We address this by modeling LLM outputs directly in a task-dependent latent structure. By equipping this structure with a dissimilarity measure, we can compute Bayes-optimal responses. These are not selected from sampled generations but are newly synthesized by combining individual responses in the latent space. Across different tasks, Bayes-optimal responses consistently outperform standard decoding methods like beam search. Moreover, quantifying uncertainty via the induced Bayesian risk captures variations in terms of the latent structure and improves alignment with output quality and correctness. Our decision-theoretic framework is applicable to any problem that admits a latent response structure and enables reliable task-aware LLM predictions.
Related papers
- Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume [45.38219855706969]
We introduce UMPIRE, a training-free uncertainty quantification framework for Multimodal Large Language Models (MLLMs)<n>UMPIRE computes the incoherence-adjusted semantic volume of sampled MLLM responses for a given task instance.<n>We show that UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video-text benchmarks.
arXiv Detail & Related papers (2026-02-27T17:18:42Z) - Operational Robustness of LLMs on Code Generation [2.9232837969697965]
It is now common practice in software development for large language models (LLMs) to be used to generate program code.<n>This paper is concerned in particular with how sensitive LLMs are to variations in descriptions of the coding tasks.<n>Existing techniques for evaluating this robustness are unsuitable for code generation because the input data space of natural language descriptions is discrete.
arXiv Detail & Related papers (2026-02-21T11:21:13Z) - FORESTLLM: Large Language Models Make Random Forest Great on Few-shot Tabular Learning [20.27406245916013]
We propose a novel framework that unifies the structural inductive biases of decision forests with the semantic reasoning capabilities of large language models (LLMs)<n>Our method is two-fold. First, we introduce a semantic splitting criterion in which the LLM evaluates candidate partitions based on their coherence over both labeled and unlabeled data, enabling the induction of more robust and generalizable tree structures under few-shot supervision.<n>Second, we propose a one-time in-context inference mechanism for leaf node stabilization, where the LLM distills the decision path and its supporting examples into a concise, deterministic prediction, replacing noisy empirical estimates with semantically informed outputs
arXiv Detail & Related papers (2026-01-16T14:08:51Z) - Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z) - Automatic Posology Structuration : What role for LLMs? [1.0445560141983634]
We explore the use of Large Language Models (LLMs) to convert free-text posologies into structured formats.<n>Our results show that while prompting improves performance, only fine-tuned LLMs match the accuracy of the baseline.<n>Based on this, we propose a hybrid pipeline that routes low-confidence cases from NERL to the LLM, selecting outputs based on confidence scores.
arXiv Detail & Related papers (2025-06-24T11:25:21Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities [79.9629927171974]
Uncertainty in Large Language Models (LLMs) is crucial for applications where safety and reliability are important.
We propose Kernel Language Entropy (KLE), a novel method for uncertainty estimation in white- and black-box LLMs.
arXiv Detail & Related papers (2024-05-30T12:42:05Z) - Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression [19.69104070561701]
Large language models (LLMs) can generate long-form and coherent text, yet they often hallucinate facts.
We propose LITO, a Learnable Intervention method for Truthfulness Optimization.
Experiments on multiple LLMs and question-answering datasets demonstrate that LITO improves truthfulness while preserving task accuracy.
arXiv Detail & Related papers (2024-05-01T03:50:09Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability.
In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling.
Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.