Related papers: API Is Enough: Conformal Prediction for Large Language Models Without Logit-Access

API Is Enough: Conformal Prediction for Large Language Models Without Logit-Access

URL: http://arxiv.org/abs/2403.01216v2
Date: Thu, 4 Apr 2024 02:15:39 GMT
Title: API Is Enough: Conformal Prediction for Large Language Models Without Logit-Access
Authors: Jiayuan Su, Jing Luo, Hongwei Wang, Lu Cheng,
Abstract summary: This study aims to address the pervasive challenge of quantifying uncertainty in large language models (LLMs) without logit-access. Existing Conformal Prediction (CP) methods for LLMs typically assume access to the logits, which are unavailable for some API-only LLMs. We introduce a novel CP method that is tailored for API-only LLMs without logit-access; (2) minimizes the size of prediction sets; and (3) ensures a statistical guarantee of the user-defined coverage.
Score: 5.922444371605447
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study aims to address the pervasive challenge of quantifying uncertainty in large language models (LLMs) without logit-access. Conformal Prediction (CP), known for its model-agnostic and distribution-free features, is a desired approach for various LLMs and data distributions. However, existing CP methods for LLMs typically assume access to the logits, which are unavailable for some API-only LLMs. In addition, logits are known to be miscalibrated, potentially leading to degraded CP performance. To tackle these challenges, we introduce a novel CP method that (1) is tailored for API-only LLMs without logit-access; (2) minimizes the size of prediction sets; and (3) ensures a statistical guarantee of the user-defined coverage. The core idea of this approach is to formulate nonconformity measures using both coarse-grained (i.e., sample frequency) and fine-grained uncertainty notions (e.g., semantic similarity). Experimental results on both close-ended and open-ended Question Answering tasks show our approach can mostly outperform the logit-based CP baselines.

Related papers

Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z)
Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs [6.862079218077768]
Testing Large Language Models on specific tasks is difficult and costly.<n>A key challenge is to assess input adequacy in a way that reflects the demands of the task.<n>We introduce CLOTHO, a task-specific, pre-generation adequacy measure.
arXiv Detail & Related papers (2025-09-22T02:34:09Z)
Conformal Prediction Beyond the Seen: A Missing Mass Perspective for Uncertainty Quantification in Generative Models [20.810300785340072]
Conformal Prediction with Query Oracle (CPQ) is a framework characterizing the optimal interplay between these objectives.<n>Our algorithm is built on two core principles: one governs the optimal query policy, and the other defines the optimal mapping from queried samples to prediction sets.
arXiv Detail & Related papers (2025-06-05T18:26:14Z)
Estimating Item Difficulty Using Large Language Models and Tree-Based Machine Learning Algorithms [0.0]
Estimating item difficulty through field-testing is often resource-intensive and time-consuming. The present research examines the feasibility of using an Large Language Models (LLMs) to predict item difficulty for K-5 mathematics and reading assessment items.
arXiv Detail & Related papers (2025-04-09T00:04:07Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive. LCD can distort the global distribution over strings, sampling tokens based only on local information. We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [60.881609323604685]
Large Language Models (LLMs) accessed via black-box APIs introduce a trust challenge. Users pay for services based on advertised model capabilities. providers may covertly substitute the specified model with a cheaper, lower-quality alternative to reduce operational costs. This lack of transparency undermines fairness, erodes trust, and complicates reliable benchmarking.
arXiv Detail & Related papers (2025-04-07T03:57:41Z)
Forking Paths in Neural Text Generation [14.75166317633176]
We develop a novel approach to representing uncertainty dynamics across individual tokens of text generation. We use our method to analyze LLM responses on 7 different tasks across 4 domains. We find many examples of forking tokens, including surprising ones such as punctuation marks.
arXiv Detail & Related papers (2024-12-10T22:57:57Z)
Predicting Emergent Capabilities by Finetuning [98.9684114851891]
We find that finetuning language models can shift the point in scaling at which emergence occurs towards less capable models. We validate this approach using four standard NLP benchmarks. We find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged.
arXiv Detail & Related papers (2024-11-25T01:48:09Z)
Few-shot Open Relation Extraction with Gaussian Prototype and Adaptive Margin [15.118656235473921]
Few-shot relation extraction with none-of-the-above (FsRE with NOTA) aims at predicting labels in few-shot scenarios with unknown classes. We propose a novel framework based on Gaussian prototype and adaptive margin named GPAM for FsRE with NOTA.
arXiv Detail & Related papers (2024-10-27T03:16:09Z)
Horizon-Length Prediction: Advancing Fill-in-the-Middle Capabilities for Code Generation with Lookahead Planning [17.01133761213624]
We propose Horizon-Length Prediction (HLP), a novel training objective that teaches models to predict the number of remaining middle tokens at each step. HLP significantly improves FIM performance by up to 24% relatively on diverse benchmarks, across file-level and repository-level, and without resorting to unrealistic post-processing methods.
arXiv Detail & Related papers (2024-10-04T02:53:52Z)
Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method [108.56493934296687]
We introduce a divergence-based calibration method, inspired by the divergence-from-randomness concept, to calibrate token probabilities for pretraining data detection. We have developed a Chinese-language benchmark, PatentMIA, to assess the performance of detection approaches for LLMs on Chinese text.
arXiv Detail & Related papers (2024-09-23T07:55:35Z)
Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models [104.55763564037831]
We train a regression model that leverages attention maps, probabilities on the current generation step, and recurrently computed uncertainty scores from previously generated tokens.<n>Our evaluation shows that the proposed method is highly effective for selective generation, achieving substantial improvements over rivaling unsupervised and supervised approaches.
arXiv Detail & Related papers (2024-08-20T09:42:26Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Semantic Density: Uncertainty Quantification for Large Language Models through Confidence Measurement in Semantic Space [14.715989394285238]
Existing Large Language Models (LLMs) do not have an inherent functionality to provide the users with an uncertainty/confidence metric for each response it generates. A new framework is proposed in this paper to address these issues. Semantic density extracts uncertainty/confidence information for each response from a probability distribution perspective in semantic space.
arXiv Detail & Related papers (2024-05-22T17:13:49Z)
One-bit Submission for Locally Private Quasi-MLE: Its Asymptotic Normality and Limitation [3.050919759387985]
Local differential privacy(LDP) is an information-theoretic privacy definition suitable for statistical surveys that involve an untrusted data curator. The existing method to build LDP QMLE is difficult to implement for a large-scale survey system in the real world due to long waiting time, expensive communication cost, and the boundedness assumption of derivative of a log-likelihood function. We provided an alternative LDP protocol without those issues, which is potentially much easily deployable to a large-scale survey.
arXiv Detail & Related papers (2022-02-15T05:04:59Z)
Learning, compression, and leakage: Minimising classification error via meta-universal compression principles [87.054014983402]
A promising group of compression techniques for learning scenarios is normalised maximum likelihood (NML) coding. Here we consider a NML-based decision strategy for supervised classification problems, and show that it attains PAC learning when applied to a wide variety of models. We show that the misclassification rate of our method is upper bounded by the maximal leakage, a recently proposed metric to quantify the potential of data leakage in privacy-sensitive scenarios.
arXiv Detail & Related papers (2020-10-14T20:03:58Z)
Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator) We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$. We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.