LLM Unlearning with LLM Beliefs
- URL: http://arxiv.org/abs/2510.19422v1
- Date: Wed, 22 Oct 2025 09:44:36 GMT
- Title: LLM Unlearning with LLM Beliefs
- Authors: Kemou Li, Qizhou Wang, Yue Wang, Fengpeng Li, Jun Liu, Bo Han, Jiantao Zhou,
- Abstract summary: Large language models trained on vast corpora inherently risk memorizing sensitive or harmful content, which may later resurface in their outputs.<n>We propose a bootstrapping framework that explicitly links the squeezing effect with the model's own high-confidence generations.<n>By jointly suppressing both target responses and model beliefs, BS-T (token) attenuates high-probability tokens, whereas BS-S (sequence) removes entire high-confidence generations.
- Score: 39.271253385135644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models trained on vast corpora inherently risk memorizing sensitive or harmful content, which may later resurface in their outputs. Prevailing unlearning methods generally rely on gradient ascent and its variants to lower the probability of specific target responses. However, we find that this strategy induces a critical side effect: probability mass is redistributed into high-likelihood regions, often corresponding to semantically related rephrasings of the targets. We refer to this as the squeezing effect, which explains why many methods yield merely spurious unlearning, a problem further obscured by automated metrics (e.g., ROUGE, truth ratio) that misreport actual success. To address this, we propose a bootstrapping (BS) framework that explicitly links the squeezing effect with the model's own high-confidence generations, namely its model beliefs. Since model beliefs inherently capture the very high-likelihood regions where probability mass is squeezed, incorporating them into the unlearning objective directly counters the squeezing effect. By jointly suppressing both target responses and model beliefs, BS-T (token) attenuates high-probability tokens, whereas BS-S (sequence) removes entire high-confidence generations, together achieving more thorough forgetting while preserving utility. Extensive experiments across diverse benchmarks with various model families confirm the effectiveness of our approach.
Related papers
- Amortized Reasoning Tree Search: Decoupling Proposal and Decision in Large Language Models [2.5170433424424874]
Reinforcement Learning with Verifiable Rewards has established itself as the dominant paradigm for instilling rigorous reasoning capabilities in Large Language Models.<n>We identify a critical pathology in this alignment process: the systematic suppression of valid but rare (low-likelihood under the base model distribution) reasoning paths.<n>We propose Amortized Reasoning Tree Search (ARTS) to counteract this collapse without discarding the base model's latent diversity.
arXiv Detail & Related papers (2026-02-13T11:52:50Z) - Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check [60.77691669644931]
We propose Functional Alignment for Distributional Equivalence (FADE), a novel metric that measures distributional similarity between unlearned and reference models.<n>We show that FADE captures functional alignment across the entire output distribution, providing a principled assessment of genuine unlearning.<n>These findings expose fundamental gaps in current evaluation practices and demonstrate that FADE provides a more robust foundation for developing and assessing truly effective unlearning methods.
arXiv Detail & Related papers (2025-10-14T20:50:30Z) - Efficient Uncertainty in LLMs through Evidential Knowledge Distillation [3.864321514889099]
We introduce a novel approach enabling efficient and effective uncertainty estimation in LLMs without sacrificing performance.<n>We distill uncertainty-aware teacher models into compact student models sharing the same architecture but fine-tuned using Low-Rank Adaptation (LoRA)<n> Empirical evaluations on classification datasets demonstrate that such students can achieve comparable or superior predictive and uncertainty quantification performance.
arXiv Detail & Related papers (2025-07-24T12:46:40Z) - Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge [1.1666234644810893]
Small models outperform larger ones in safety, suggesting that training and architecture may matter more than scale.<n>No model is fully robust to adversarial elicitation, with jailbreak attacks using low-resource languages or refusal suppression proving effective.
arXiv Detail & Related papers (2025-04-10T16:00:59Z) - Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM)<n>To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z) - Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial.
Our research investigates the fragility of uncertainty estimation and explores potential attacks.
We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z) - Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration [32.15773300068426]
Membership Inference Attacks aim to infer whether a target data record has been utilized for model training.
We propose a Membership Inference Attack based on Self-calibrated Probabilistic Variation (SPV-MIA)
arXiv Detail & Related papers (2023-11-10T13:55:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.