Related papers: A Probabilistic Perspective on Unlearning and Alignment for Large Language Models

A Probabilistic Perspective on Unlearning and Alignment for Large Language Models

URL: http://arxiv.org/abs/2410.03523v3
Date: Wed, 6 Nov 2024 17:19:39 GMT
Title: A Probabilistic Perspective on Unlearning and Alignment for Large Language Models
Authors: Yan Scholten, Stephan Günnemann, Leo Schwinn,
Abstract summary: We introduce the first formal probabilistic evaluation framework in Large Language Models (LLMs) We derive novel metrics with high-probability guarantees concerning the output distribution of a model. Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment.
Score: 48.96686419141881
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Comprehensive evaluation of Large Language Models (LLMs) is an open research problem. Existing evaluations rely on deterministic point estimates generated via greedy decoding. However, we find that deterministic evaluations fail to capture the whole output distribution of a model, yielding inaccurate estimations of model capabilities. This is particularly problematic in critical contexts such as unlearning and alignment, where precise model evaluations are crucial. To remedy this, we introduce the first formal probabilistic evaluation framework in LLMs. Namely, we derive novel metrics with high-probability guarantees concerning the output distribution of a model. Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment. Through a case study focused on unlearning, we reveal that deterministic evaluations falsely indicate successful unlearning, whereas our probabilistic evaluations demonstrate that most if not all of the supposedly unlearned information remains accessible in these models. Additionally, we propose a novel unlearning loss based on entropy optimization and adaptive temperature scaling, which significantly improves unlearning in probabilistic settings on recent benchmarks. Our proposed shift from point estimates to probabilistic evaluations of output distributions represents an important step toward comprehensive evaluations of LLMs. Code available at https://github.com/yascho/probabilistic-unlearning.

Related papers

Learning Ordinal Probabilistic Reward from Preferences [25.069054134899744]
We introduce a novel reward modeling paradigm: Probabilistic Reward Model (PRM)<n>Instead of modeling reward as a deterministic scalar, our approach treats it as a random variable, learning a full probability distribution for the quality of each response.<n>Building on OPRM, we propose a data-efficient training strategy called Region Flooding Tuning (RgFT)
arXiv Detail & Related papers (2026-02-13T06:43:02Z)
D-Models and E-Models: Diversity-Stability Trade-offs in the Sampling Behavior of Large Language Models [91.21455683212224]
In large language models (LLMs), the probability of relevance for the next piece of information is linked to the probability of relevance for the next product.<n>But whether fine-grained sampling probabilities faithfully align with task requirements remains an open question.<n>We identify two model types: D-models, whose P_token exhibits large step-to-step variability and poor alignment with P_task; and E-models, whose P_token is more stable and better aligned with P_task.
arXiv Detail & Related papers (2026-01-25T14:59:09Z)
Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check [60.77691669644931]
We propose Functional Alignment for Distributional Equivalence (FADE), a novel metric that measures distributional similarity between unlearned and reference models.<n>We show that FADE captures functional alignment across the entire output distribution, providing a principled assessment of genuine unlearning.<n>These findings expose fundamental gaps in current evaluation practices and demonstrate that FADE provides a more robust foundation for developing and assessing truly effective unlearning methods.
arXiv Detail & Related papers (2025-10-14T20:50:30Z)
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs [7.197702136906138]
We propose an uncertainty-aware fairness metric, UCerF, to enable a fine-grained evaluation of model fairness.<n> observing data size, diversity, and clarity issues in current datasets, we introduce a new gender-occupation fairness evaluation dataset.<n>We establish a benchmark, using our metric and dataset, and apply it to evaluate the behavior of ten open-source AI systems.
arXiv Detail & Related papers (2025-05-29T20:45:18Z)
Always Tell Me The Odds: Fine-grained Conditional Probability Estimation [37.950889606305836]
We present a state-of-the-art model for fine-grained probability estimation of propositions conditioned on context.<n>We show that our approach consistently outperforms existing fine-tuned and prompting-based methods by a large margin.
arXiv Detail & Related papers (2025-05-02T21:33:18Z)
Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges [13.526258635654882]
This study introduces a Bayesian approach for large language models (LLMs) capability assessment. We treat model capabilities as latent variables and leverage a curated query set to induce discriminative responses. Experimental evaluations with GPT-series models demonstrate that the proposed method achieves superior discrimination compared to conventional evaluation methods.
arXiv Detail & Related papers (2025-04-30T04:24:50Z)
Are We Truly Forgetting? A Critical Re-examination of Machine Unlearning Evaluation Protocols [14.961054239793356]
We introduce a rigorous unlearning evaluation setup, in which forgetting classes exhibit semantic similarity to downstream task classes.<n>We hope our benchmark serves as a standardized protocol for evaluating unlearning algorithms under realistic conditions.
arXiv Detail & Related papers (2025-03-10T07:11:34Z)
Model-free Methods for Event History Analysis and Efficient Adjustment (PhD Thesis) [55.2480439325792]
This thesis is a series of independent contributions to statistics unified by a model-free perspective. The first chapter elaborates on how a model-free perspective can be used to formulate flexible methods that leverage prediction techniques from machine learning. The second chapter studies the concept of local independence, which describes whether the evolution of one process is directly influenced by another.
arXiv Detail & Related papers (2025-02-11T19:24:09Z)
On Discriminative Probabilistic Modeling for Self-Supervised Representation Learning [85.75164588939185]
We study the discriminative probabilistic modeling on a continuous domain for the data prediction task of (multimodal) self-supervised representation learning. We conduct generalization error analysis to reveal the limitation of current InfoNCE-based contrastive loss for self-supervised representation learning. We propose a novel non-parametric method for approximating the sum of conditional probability densities required by MIS.
arXiv Detail & Related papers (2024-10-11T18:02:46Z)
Deep Probability Segmentation: Are segmentation models probability estimators? [0.7646713951724011]
We apply Calibrated Probability Estimation to segmentation tasks to evaluate its impact on model calibration. Results indicate that while CaPE improves calibration, its effect is less pronounced compared to classification tasks. We also investigated the influence of dataset size and bin optimization on the effectiveness of calibration.
arXiv Detail & Related papers (2024-09-19T07:52:19Z)
Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning [4.185571779339683]
In model-based reinforcement learning, simulated experiences are often treated as equivalent to experience from the real environment. We show that best results require distribution insensitive inference to estimate the uncertainty over model-based updates. We find that bounding-box inference can reliably support effective selective planning.
arXiv Detail & Related papers (2024-06-23T04:23:15Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Rethinking Evaluation Metric for Probability Estimation Models Using Esports Data [8.10304644344495]
We propose a novel metric called Balance score which is a simple yet effective metric in terms of six good properties that probability estimation metric should have. Under the general condition, we also found that the Balance score can be an effective approximation of the true expected calibration error.
arXiv Detail & Related papers (2023-09-12T14:04:12Z)
Plan To Predict: Learning an Uncertainty-Foreseeing Model for Model-Based Reinforcement Learning [32.24146877835396]
We propose emphPlan To Predict (P2P), a framework that treats the model rollout process as a sequential decision making problem. We show that P2P achieves state-of-the-art performance on several challenging benchmark tasks.
arXiv Detail & Related papers (2023-01-20T10:17:22Z)
Exploring validation metrics for offline model-based optimisation with diffusion models [50.404829846182764]
In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle. While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples. This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation.
arXiv Detail & Related papers (2022-11-19T16:57:37Z)
The Implicit Delta Method [61.36121543728134]
In this paper, we propose an alternative, the implicit delta method, which works by infinitesimally regularizing the training loss of uncertainty. We show that the change in the evaluation due to regularization is consistent for the variance of the evaluation estimator, even when the infinitesimal change is approximated by a finite difference.
arXiv Detail & Related papers (2022-11-11T19:34:17Z)
Calibration tests beyond classification [30.616624345970973]
Most supervised machine learning tasks are subject to irreducible prediction errors. Probabilistic predictive models address this limitation by providing probability distributions that represent a belief over plausible targets. Calibrated models guarantee that the predictions are neither over- nor under-confident.
arXiv Detail & Related papers (2022-10-21T09:49:57Z)
BRIO: Bringing Order to Abstractive Summarization [107.97378285293507]
We propose a novel training paradigm which assumes a non-deterministic distribution. Our method achieves a new state-of-the-art result on the CNN/DailyMail (47.78 ROUGE-1) and XSum (49.07 ROUGE-1) datasets.
arXiv Detail & Related papers (2022-03-31T05:19:38Z)
Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties. Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z)
Variance based sensitivity analysis for Monte Carlo and importance sampling reliability assessment with Gaussian processes [0.0]
We propose a methodology to quantify the sensitivity of the probability of failure estimator to two uncertainty sources. This analysis also enables to control the whole error associated to the failure probability estimate and thus provides an accuracy criterion on the estimation. The approach is proposed for both a Monte Carlo based method as well as an importance sampling based method, seeking to improve the estimation of rare event probabilities.
arXiv Detail & Related papers (2020-11-30T17:06:28Z)
Bootstrapped model learning and error correction for planning with uncertainty in model-based RL [1.370633147306388]
A natural aim is to learn a model that reflects accurately the dynamics of the environment. This paper explores the problem of model misspecification through uncertainty-aware reinforcement learning agents. We propose a bootstrapped multi-headed neural network that learns the distribution of future states and rewards.
arXiv Detail & Related papers (2020-04-15T15:41:21Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.