Related papers: Interpretable Probability Estimation with LLMs via Shapley Reconstruction

Interpretable Probability Estimation with LLMs via Shapley Reconstruction

URL: http://arxiv.org/abs/2601.09151v1
Date: Wed, 14 Jan 2026 04:45:36 GMT
Title: Interpretable Probability Estimation with LLMs via Shapley Reconstruction
Authors: Yang Nan, Qihao Wen, Jiahao Wang, Pengfei He, Ravi Tandon, Yong Ge, Han Xu,
Abstract summary: PRISM: Probability Reconstruction via Shapley Measures is a framework that brings transparency and precision to probability estimation.<n>In our experiments, we demonstrate PRISM improves predictive accuracy over direct prompting.<n>Our case studies visualize how individual factors shape the final estimate, helping build trust in LLM-based decision support systems.
Score: 21.224475598322538
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) demonstrate potential to estimate the probability of uncertain events, by leveraging their extensive knowledge and reasoning capabilities. This ability can be applied to support intelligent decision-making across diverse fields, such as financial forecasting and preventive healthcare. However, directly prompting LLMs for probability estimation faces significant challenges: their outputs are often noisy, and the underlying predicting process is opaque. In this paper, we propose PRISM: Probability Reconstruction via Shapley Measures, a framework that brings transparency and precision to LLM-based probability estimation. PRISM decomposes an LLM's prediction by quantifying the marginal contribution of each input factor using Shapley values. These factor-level contributions are then aggregated to reconstruct a calibrated final estimate. In our experiments, we demonstrate PRISM improves predictive accuracy over direct prompting and other baselines, across multiple domains including finance, healthcare, and agriculture. Beyond performance, PRISM provides a transparent prediction pipeline: our case studies visualize how individual factors shape the final estimate, helping build trust in LLM-based decision support systems.

Related papers

On Calibration of Large Language Models: From Response To Capability [66.59139960234326]
Large language models (LLMs) are widely deployed as general-purpose problem solvers.<n>We introduce capability calibration, which targets the model's expected accuracy on a query.<n>Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation.
arXiv Detail & Related papers (2026-02-14T01:07:45Z)
LLM Performance Predictors: Learning When to Escalate in Hybrid Human-AI Moderation Systems [5.7001352660257005]
We propose a framework for supervised uncertainty quantification in content moderation systems.<n>We show that our method enables cost-aware selective classification in real-world human-AI.<n>This work establishes a principled framework for uncertainty-aware, scalable and responsible human-AI moderation.
arXiv Detail & Related papers (2026-01-11T17:46:49Z)
The Forecast Critic: Leveraging Large Language Models for Poor Forecast Identification [74.64864354503204]
We propose The Forecast Critic, a system that leverages Large Language Models (LLMs) for automated forecast monitoring.<n>We evaluate the ability of LLMs to assess time series forecast quality.<n>We present three experiments, including on both synthetic and real-world forecasting data.
arXiv Detail & Related papers (2025-12-12T21:59:53Z)
Conformal P-Value in Multiple-Choice Question Answering Tasks with Provable Risk Control [0.0]
This study introduces a significance testing-enhanced conformal prediction (CP) framework to improve trustworthiness of large language models (LLMs) in multiple-choice question answering (MCQA)<n>CP provides statistically rigorous marginal coverage guarantees for prediction sets, and significance testing offers established statistical rigor.<n>This work establishes a statistical framework for trustworthy LLM deployment in high-stakes QA applications.
arXiv Detail & Related papers (2025-08-07T16:46:47Z)
PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z)
Quantifying Prediction Consistency Under Fine-Tuning Multiplicity in Tabular LLMs [10.494477811252034]
Fine-tuning multiplicity can arise in Tabular LLMs on classification tasks.<n>Our work formalizes this unique challenge of fine-tuning multiplicity in Tabular LLMs.<n>We propose a novel measure to quantify consistency of individual predictions without expensive model retraining.
arXiv Detail & Related papers (2024-07-04T22:22:09Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models [52.46248487458641]
Predictive models often need to work with incomplete information in real-world tasks.<n>Current large language models (LLMs) are insufficient for accurate estimations.<n>We propose BIRD, a novel probabilistic inference framework.
arXiv Detail & Related papers (2024-04-18T20:17:23Z)
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models [24.445829787297658]
Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications. This study aims to scrutinize the validity of such probability-based evaluation methods within the context of using LLMs for Multiple Choice Questions (MCQs) Our empirical investigation reveals that the prevalent probability-based evaluation method inadequately aligns with generation-based prediction.
arXiv Detail & Related papers (2024-02-21T15:58:37Z)
Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability. In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling. Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.