Related papers: Leverage Unlearning to Sanitize LLMs

Leverage Unlearning to Sanitize LLMs

URL: http://arxiv.org/abs/2510.21322v1
Date: Fri, 24 Oct 2025 10:28:40 GMT
Title: Leverage Unlearning to Sanitize LLMs
Authors: Antoine Boutet, Lucas Magnana,
Abstract summary: We present SANI, an unlearning approach to sanitize language models.<n>It relies on both an erasure and repair phases that 1) reset certain neurons in the last layers of the model to disrupt memorization of fine-grained information, and then 2) fine-tune the model while avoiding memorizing sensitive information.<n>Results show that with only few additional epochs of unlearning, the model is sanitized and the number of regurgitations is drastically reduced.
Score: 0.3867363075280543
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Pre-trained large language models (LLMs) are becoming useful for various tasks. To improve their performance on certain tasks, it is necessary to fine-tune them on specific data corpora (e.g., medical reports, business data). These specialized data corpora may contain sensitive data (e.g., personal or confidential data) that will be memorized by the model and likely to be regurgitated during its subsequent use. This memorization of sensitive information by the model poses a significant privacy or confidentiality issue. To remove this memorization and sanitize the model without requiring costly additional fine-tuning on a secured data corpus, we propose SANI. SANI is an unlearning approach to sanitize language models. It relies on both an erasure and repair phases that 1) reset certain neurons in the last layers of the model to disrupt the memorization of fine-grained information, and then 2) fine-tune the model while avoiding memorizing sensitive information. We comprehensively evaluate SANI to sanitize both a model fine-tuned and specialized with medical data by removing directly and indirectly identifiers from the memorization of the model, and a standard pre-trained model by removing specific terms defined as confidential information from the model. Results show that with only few additional epochs of unlearning, the model is sanitized and the number of regurgitations is drastically reduced. This approach can be particularly useful for hospitals or other industries that have already spent significant resources training models on large datasets and wish to sanitize them before sharing.

Related papers

Reveal and Release: Iterative LLM Unlearning with Self-generated Data [5.932877449308903]
We propose a Reveal-and-Release'' method to unlearn with self-generated data.<n>We make incremental adjustments to the model's weight space with parameter-efficient modules trained on the forget data.
arXiv Detail & Related papers (2025-09-18T05:07:27Z)
Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs [54.167494079321465]
Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their fine-tuning data.<n>We propose a novel unlearning method-Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective.
arXiv Detail & Related papers (2025-07-06T03:08:49Z)
FUNU: Boosting Machine Unlearning Efficiency by Filtering Unnecessary Unlearning [9.472692023087223]
We propose FUNU, a method to identify data points that lead to unnecessary unlearning.<n>We provide a theoretical analysis of FUNU and conduct extensive experiments to validate its efficacy.
arXiv Detail & Related papers (2025-01-28T01:19:07Z)
AutoElicit: Using Large Language Models for Expert Prior Elicitation in Predictive Modelling [53.54623137152208]
We introduce AutoElicit to extract knowledge from large language models and construct priors for predictive models.<n>We show these priors are informative and can be refined using natural language.<n>We find that AutoElicit yields priors that can substantially reduce error over uninformative priors, using fewer labels, and consistently outperform in-context learning.
arXiv Detail & Related papers (2024-11-26T10:13:39Z)
REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space [40.25037054636284]
Language models (LMs) risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in training data, causing privacy concerns.<n>We propose REVS, a novel non-gradient-based method for unlearning sensitive information from LMs.
arXiv Detail & Related papers (2024-06-13T17:02:32Z)
Quantifying and Analyzing Entity-level Memorization in Large Language Models [4.59914731734176]
Large language models (LLMs) have been proven capable of memorizing their training data. Privacy risks arising from memorization have attracted increasing attention. We propose a fine-grained, entity-level definition to quantify memorization with conditions and metrics closer to real-world scenarios.
arXiv Detail & Related papers (2023-08-30T03:06:47Z)
Emergent and Predictable Memorization in Large Language Models [23.567027014457775]
Memorization, or the tendency of large language models to output entire sequences from their training data verbatim, is a key concern for safely deploying language models. We seek to predict which sequences will be memorized before a large model's full train-time by extrapolating the memorization behavior of lower-compute trial runs. We provide further novel discoveries on the distribution of memorization scores across models and data.
arXiv Detail & Related papers (2023-04-21T17:58:31Z)
AI Model Disgorgement: Methods and Choices [127.54319351058167]
We introduce a taxonomy of possible disgorgement methods that are applicable to modern machine learning systems. We investigate the meaning of "removing the effects" of data in the trained model in a way that does not require retraining from scratch.
arXiv Detail & Related papers (2023-04-07T08:50:18Z)
Synthetic Model Combination: An Instance-wise Approach to Unsupervised Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data. Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z)
Quantifying Memorization Across Neural Language Models [61.58529162310382]
Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized data verbatim. This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others). We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data.
arXiv Detail & Related papers (2022-02-15T18:48:31Z)
SSSE: Efficiently Erasing Samples from Trained Machine Learning Models [103.43466657962242]
We propose an efficient and effective algorithm, SSSE, for samples erasure. In certain cases SSSE can erase samples almost as well as the optimal, yet impractical, gold standard of training a new model from scratch with only the permitted data.
arXiv Detail & Related papers (2021-07-08T14:17:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.