CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment
- URL: http://arxiv.org/abs/2602.02824v1
- Date: Mon, 02 Feb 2026 21:23:54 GMT
- Title: CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment
- Authors: Zhengbang Yang, Yisheng Zhong, Junyuan Hong, Zhuangdi Zhu,
- Abstract summary: Existing approaches, rooted in Gradient Ascent (GA), often degrade general domain knowledge while relying on retention data or curated contrastive pairs.<n>We develop a principled method that rescales unlearning effects in proportion to the model's token-level confidence.<n>Our work enables effective unlearning without requiring retention data or contrastive unlearning response pairs.
- Score: 14.853204323785334
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pretrained knowledge memorized in LLMs raises critical concerns over safety and privacy, which has motivated LLM Unlearning as a technique for selectively removing the influences of undesirable knowledge. Existing approaches, rooted in Gradient Ascent (GA), often degrade general domain knowledge while relying on retention data or curated contrastive pairs, which can be either impractical or data and computationally prohibitive. Negative Preference Alignment has been explored for unlearning to tackle the limitations of GA, which, however, remains confined by its choice of reference model and shows undermined performance in realistic data settings. These limitations raise two key questions: i) Can we achieve effective unlearning that quantifies model confidence in undesirable knowledge and uses it to calibrate gradient updates more precisely, thus reducing catastrophic forgetting? ii) Can we make unlearning robust to data scarcity and length variation? We answer both questions affirmatively with CATNIP (Calibrated and Tokenized Negative Preference Alignment), a principled method that rescales unlearning effects in proportion to the model's token-level confidence, thus ensuring fine-grained control over forgetting. Extensive evaluations on MUSE and WMDP benchmarks demonstrated that our work enables effective unlearning without requiring retention data or contrastive unlearning response pairs, with stronger knowledge forgetting and preservation tradeoffs than state-of-the-art methods.
Related papers
- MeGU: Machine-Guided Unlearning with Target Feature Disentanglement [73.49657372882082]
We propose a novel framework that guides unlearning through concept-aware re-alignment.<n>MeGU enables controlled and selective forgetting, effectively mitigating both under-unlearning and over-unlearning.
arXiv Detail & Related papers (2026-02-19T05:20:31Z) - Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms [3.648393062009244]
Large language models (LLMs) are increasingly trained on massive, heterogeneous text corpora.<n>This raises serious concerns about the unauthorised use of proprietary or personal data during model training.<n>We propose Disclaimer Injection, a novel data-level defence that renders text unlearnable to LLMs.
arXiv Detail & Related papers (2026-01-06T20:34:15Z) - Forgetting-MarI: LLM Unlearning via Marginal Information Regularization [6.979586479353831]
Existing unlearning methods often degrade model performance by removing more information than necessary when attempting to ''forget'' specific data.<n>We introduce Forgetting-MarI, an LLM unlearning framework that provably removes only the additional (marginal) information contributed by the data to be unlearned.<n>By penalizing marginal information, our method yields an explicit upper bound on the unlearn dataset's residual influence in the trained models, providing provable undetectability.
arXiv Detail & Related papers (2025-11-14T22:48:39Z) - Unlearning That Lasts: Utility-Preserving, Robust, and Almost Irreversible Forgetting in LLMs [31.768387661474904]
Unlearning in large language models (LLMs) involves precisely removing specific information from a pre-trained model.<n>This is crucial to ensure safety of LLMs by deleting private data or harmful knowledge acquired during pre-training.<n>We introduce JensUn, where we leverage the Jensen-Shannon Divergence as the training objective for both forget and retain sets.<n>In extensive experiments, JensUn achieves better forget-utility trade-off than competing methods, and even demonstrates strong resilience to benign relearning.
arXiv Detail & Related papers (2025-09-02T20:38:53Z) - GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models [17.83305806604326]
GUARD is a framework for guided unlearning and retention via data attribution.<n>It assigns adaptive, nonuniform unlearning weights to samples, inversely proportional to their proxy attribution scores.<n>We provide rigorous theoretical guarantees that GUARD significantly improves retention while maintaining forgetting metrics comparable to prior methods.
arXiv Detail & Related papers (2025-06-12T17:49:09Z) - UniErase: Towards Balanced and Precise Unlearning in Language Models [69.04923022755547]
Large language models (LLMs) require iterative updates to address the outdated information problem.<n>UniErase is a novel unlearning framework that demonstrates precision and balanced performances between knowledge unlearning and ability retaining.
arXiv Detail & Related papers (2025-05-21T15:53:28Z) - UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning [57.081646768835704]
User specifications or legal frameworks often require information to be removed from pretrained models, including large language models (LLMs)<n>This requires deleting or "forgetting" a set of data points from an already-trained model, which typically degrades its performance on other data points.<n>We propose UPCORE, a method-agnostic data selection framework for mitigating collateral damage during unlearning.
arXiv Detail & Related papers (2025-02-20T22:51:10Z) - Towards Robust and Parameter-Efficient Knowledge Unlearning for LLMs [25.91643745340183]
Large Language Models (LLMs) have demonstrated strong reasoning and memorization capabilities via pretraining on massive textual corpora.<n>This poses risk of privacy and copyright violations, highlighting the need for efficient machine unlearning methods.<n>We propose Low-rank Knowledge Unlearning (LoKU), a novel framework that enables robust and efficient unlearning for LLMs.
arXiv Detail & Related papers (2024-08-13T04:18:32Z) - Learn while Unlearn: An Iterative Unlearning Framework for Generative Language Models [52.40798352740857]
We introduce the Iterative Contrastive Unlearning (ICU) framework, which consists of three core components.<n>A Knowledge Unlearning Induction module targets specific knowledge for removal using an unlearning loss.<n>A Contrastive Learning Enhancement module preserves the model's expressive capabilities against the pure unlearning goal.<n>An Iterative Unlearning Refinement module dynamically adjusts the unlearning process through ongoing evaluation and updates.
arXiv Detail & Related papers (2024-07-25T07:09:35Z) - Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models [79.28821338925947]
Domain-Class Incremental Learning is a realistic but challenging continual learning scenario.
To handle these diverse tasks, pre-trained Vision-Language Models (VLMs) are introduced for their strong generalizability.
This incurs a new problem: the knowledge encoded in the pre-trained VLMs may be disturbed when adapting to new tasks, compromising their inherent zero-shot ability.
Existing methods tackle it by tuning VLMs with knowledge distillation on extra datasets, which demands heavy overhead.
We propose the Distribution-aware Interference-free Knowledge Integration (DIKI) framework, retaining pre-trained knowledge of
arXiv Detail & Related papers (2024-07-07T12:19:37Z) - Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - Unlearn What You Want to Forget: Efficient Unlearning for LLMs [92.51670143929056]
Large language models (LLMs) have achieved significant progress from pre-training on and memorizing a wide range of textual data.
This process might suffer from privacy issues and violations of data protection regulations.
We propose an efficient unlearning framework that could efficiently update LLMs without having to retrain the whole model after data removals.
arXiv Detail & Related papers (2023-10-31T03:35:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.