Related papers: Multi-LLM Adaptive Conformal Inference for Reliable LLM Responses

Multi-LLM Adaptive Conformal Inference for Reliable LLM Responses

URL: http://arxiv.org/abs/2602.01285v1
Date: Sun, 01 Feb 2026 15:34:45 GMT
Title: Multi-LLM Adaptive Conformal Inference for Reliable LLM Responses
Authors: Kangjun Noh, Seongchan Lee, Ilmun Kim, Kyungwoo Song,
Abstract summary: We reformulate conformal inference in a multiplicative filtering setting, modeling factuality as a product of claim-level scores.<n>Our method, Multi-LLM Adaptive Conformal Inference (MACI), leverages ensembles to produce more accurate factuality-scores.<n>Experiments show that MACI consistently achieves user-specified coverage with substantially higher retention and lower time cost than baselines.
Score: 18.60553322553765
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Ensuring factuality is essential for the safe use of Large Language Models (LLMs) in high-stakes domains such as medicine and law. Conformal inference provides distribution-free guarantees, but existing approaches are either overly conservative, discarding many true-claims, or rely on adaptive error rates and simple linear models that fail to capture complex group structures. To address these challenges, we reformulate conformal inference in a multiplicative filtering setting, modeling factuality as a product of claim-level scores. Our method, Multi-LLM Adaptive Conformal Inference (MACI), leverages ensembles to produce more accurate factuality-scores, which in our experiments led to higher retention, while validity is preserved through group-conditional calibration. Experiments show that MACI consistently achieves user-specified coverage with substantially higher retention and lower time cost than baselines. Our repository is available at https://github.com/MLAI-Yonsei/MACI

Related papers

Claim Automation using Large Language Model [0.0]
Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, but their deployment in regulated and data-sensitive domains, including insurance, remains limited.<n>We propose a locally deployed governance-aware language modeling component that generates structured corrective-action recommendations from unstructured claim narratives.<n>We fine-tune pretrained LLMs using Low-Rank Adaptation (LoRA), scoping the model to an initial decision module within the claim processing pipeline to speed up claim adjusters' decisions.
arXiv Detail & Related papers (2026-02-18T20:01:12Z)
Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking [64.97768177044355]
Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems.<n>We present FactArena, a fully automated arena-style evaluation framework.<n>Our analyses reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence.
arXiv Detail & Related papers (2026-01-06T02:51:56Z)
MMDCP: A Distribution-free Approach to Outlier Detection and Classification with Coverage Guarantees and SCW-FDR Control [6.429952624399788]
We propose a unified framework for multi-class classification and outlier detection under label shift.<n>The Modified Mahalanobis Distance Conformal Prediction (MMDCP) combines class-specific distance measures with full conformal prediction to construct a score function.<n>We provide the first theoretical characterization of the gap between oracle and empirical conformal $p$-values, which ensures valid coverage and effective control of the class-wise false discovery rate (CW-FDR)
arXiv Detail & Related papers (2025-11-15T03:48:44Z)
Robust Uncertainty Quantification for Self-Evolving Large Language Models via Continual Domain Pretraining [7.344577590113121]
Conformal Prediction (CP) has shown promise in offering correctness guarantees for large language models.<n>We introduce an adaptive rejection and non-exchangeable CP framework.<n>Our framework enhances both the effectiveness and reliability of CP under CDP scenarios.
arXiv Detail & Related papers (2025-10-27T02:15:51Z)
Unsupervised Conformal Inference: Bootstrapping and Alignment to Control LLM Uncertainty [49.19257648205146]
We propose an unsupervised conformal inference framework for generation.<n>Our gates achieve close-to-nominal coverage and provide tighter, more stable thresholds than split UCP.<n>The result is a label-free, API-compatible gate for test-time filtering.
arXiv Detail & Related papers (2025-09-26T23:40:47Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees [51.5976496056012]
COIN is an uncertainty-guarding selection framework that calibrates statistically valid thresholds to filter a single generated answer per question.<n>COIN estimates the empirical error rate on a calibration set and applies confidence interval methods to establish a high-probability upper bound on the true error rate.<n>We demonstrate COIN's robustness in risk control, strong test-time power in retaining admissible answers, and predictive efficiency under limited calibration data.
arXiv Detail & Related papers (2025-06-25T07:04:49Z)
Noise-Adaptive Conformal Classification with Marginal Coverage [53.74125453366155]
We introduce an adaptive conformal inference method capable of efficiently handling deviations from exchangeability caused by random label noise.<n>We validate our method through extensive numerical experiments demonstrating its effectiveness on synthetic and real data sets.
arXiv Detail & Related papers (2025-01-29T23:55:23Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Multicalibration for Confidence Scoring in LLMs [6.948522445499497]
This paper proposes the use of "multicalibration" to yield interpretable and reliable confidence scores for outputs generated by large language models (LLMs) We show how to form groupings for prompt/completion pairs that are correlated with the probability of correctness via two techniques: clustering within an embedding space, and "self-annotation" We show how our techniques can yield confidence scores that provide substantial improvements in fine-grained measures of both calibration and accuracy compared to existing methods.
arXiv Detail & Related papers (2024-04-06T17:33:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.