Related papers: Quantifying Prediction Consistency Under Fine-Tuning Multiplicity in Tabular LLMs

Quantifying Prediction Consistency Under Fine-Tuning Multiplicity in Tabular LLMs

URL: http://arxiv.org/abs/2407.04173v2
Date: Wed, 04 Jun 2025 07:12:58 GMT
Title: Quantifying Prediction Consistency Under Fine-Tuning Multiplicity in Tabular LLMs
Authors: Faisal Hamman, Pasan Dissanayake, Saumitra Mishra, Freddy Lecue, Sanghamitra Dutta,
Abstract summary: Fine-tuning multiplicity can arise in Tabular LLMs on classification tasks.<n>Our work formalizes this unique challenge of fine-tuning multiplicity in Tabular LLMs.<n>We propose a novel measure to quantify consistency of individual predictions without expensive model retraining.
Score: 10.494477811252034
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-tuning LLMs on tabular classification tasks can lead to the phenomenon of fine-tuning multiplicity where equally well-performing models make conflicting predictions on the same input. Fine-tuning multiplicity can arise due to variations in the training process, e.g., seed, weight initialization, minor changes to training data, etc., raising concerns about the reliability of Tabular LLMs in high-stakes applications such as finance, hiring, education, healthcare. Our work formalizes this unique challenge of fine-tuning multiplicity in Tabular LLMs and proposes a novel measure to quantify the consistency of individual predictions without expensive model retraining. Our measure quantifies a prediction's consistency by analyzing (sampling) the model's local behavior around that input in the embedding space. Interestingly, we show that sampling in the local neighborhood can be leveraged to provide probabilistic guarantees on prediction consistency under a broad class of fine-tuned models, i.e., inputs with sufficiently high local stability (as defined by our measure) also remain consistent across several fine-tuned models with high probability. We perform experiments on multiple real-world datasets to show that our local stability measure preemptively captures consistency under actual multiplicity across several fine-tuned models, outperforming competing measures.

Related papers

An Information-Theoretic Perspective on Multi-LLM Uncertainty Estimation [7.018119896897734]
Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings.<n>We propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs.<n>Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and naive ensemble baselines.
arXiv Detail & Related papers (2025-07-09T19:13:25Z)
Can Large Language Models Trigger a Paradigm Shift in Travel Behavior Modeling? Experiences with Modeling Travel Satisfaction [2.2974830861901414]
This study uses data on travel satisfaction from a household survey in shanghai to identify the existence and source of misalignment between Large Language Models and human behavior.<n>We find that the zero-shot LLM exhibits behavioral misalignment, resulting in relatively low prediction accuracy.<n>We propose an LLM-based modeling approach that can be applied to model travel behavior using samples of small sizes.
arXiv Detail & Related papers (2025-05-29T09:11:58Z)
Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction [0.0]
We propose a model-agnostic uncertainty quantification method that integrates dynamic threshold calibration and cross-modal consistency verification. We show that the framework achieves stable performance across varying calibration-to-test split ratios, underscoring its robustness for real-world deployment in healthcare, autonomous systems, and other safety-sensitive domains. This work bridges the gap between theoretical reliability and practical applicability in multi-modal AI systems, offering a scalable solution for hallucination detection and uncertainty-aware decision-making.
arXiv Detail & Related papers (2025-04-24T15:39:46Z)
Quantifying the Prediction Uncertainty of Machine Learning Models for Individual Data [2.1248439796866228]
This study investigates pNML's learnability for linear regression and neural networks.<n>It demonstrates that pNML can improve the performance and robustness of these models on various tasks.
arXiv Detail & Related papers (2024-12-10T13:58:19Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Semantic Density: Uncertainty Quantification for Large Language Models through Confidence Measurement in Semantic Space [14.715989394285238]
Existing Large Language Models (LLMs) do not have an inherent functionality to provide the users with an uncertainty/confidence metric for each response it generates. A new framework is proposed in this paper to address these issues. Semantic density extracts uncertainty/confidence information for each response from a probability distribution perspective in semantic space.
arXiv Detail & Related papers (2024-05-22T17:13:49Z)
BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models [52.46248487458641]
Predictive models often need to work with incomplete information in real-world tasks. Current large language models (LLM) are insufficient for such accurate estimations. We propose BIRD, a novel probabilistic inference framework.
arXiv Detail & Related papers (2024-04-18T20:17:23Z)
Language Model Cascades: Token-level uncertainty and beyond [65.38515344964647]
Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks. Cascading offers a simple strategy to achieve more favorable cost-quality tradeoffs. We show that incorporating token-level uncertainty through learned post-hoc deferral rules can significantly outperform simple aggregation strategies.
arXiv Detail & Related papers (2024-04-15T21:02:48Z)
Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency. Results show that consistency-based calibration methods outperform existing post-hoc approaches. We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z)
Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability. In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling. Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z)
Tailoring Language Generation Models under Total Variation Distance [55.89964205594829]
The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method. We develop practical bounds to apply it to language generation. We introduce the TaiLr objective that balances the tradeoff of estimating TVD.
arXiv Detail & Related papers (2023-02-26T16:32:52Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.