Related papers: REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space

REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space

URL: http://arxiv.org/abs/2406.09325v1
Date: Thu, 13 Jun 2024 17:02:32 GMT
Title: REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space
Authors: Tomer Ashuach, Martin Tutek, Yonatan Belinkov,
Abstract summary: Large language models (LLMs) risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in training data. We propose REVS, a novel model editing method for unlearning sensitive information from LLMs.
Score: 35.61862064581971
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in training data, causing privacy concerns. Current approaches to address this issue involve costly dataset scrubbing, or model filtering through unlearning and model editing, which can be bypassed through extraction attacks. We propose REVS, a novel model editing method for unlearning sensitive information from LLMs. REVS identifies and modifies a small subset of neurons relevant for each piece of sensitive information. By projecting these neurons to the vocabulary space (unembedding), we pinpoint the components driving its generation. We then compute a model edit based on the pseudo-inverse of the unembedding matrix, and apply it to de-promote generation of the targeted sensitive data. To adequately evaluate our method on truly sensitive information, we curate two datasets: an email dataset inherently memorized by GPT-J, and a synthetic social security number dataset that we tune the model to memorize. Compared to other state-of-the-art model editing methods, REVS demonstrates superior performance in both eliminating sensitive information and robustness to extraction attacks, while retaining integrity of the underlying model. The code and a demo notebook are available at https://technion-cs-nlp.github.io/REVS.

Related papers

Model Inversion Attacks on Llama 3: Extracting PII from Large Language Models [0.0]
Large language models (LLMs) have transformed natural language processing, but their ability to memorize training data poses significant privacy risks.<n>This paper investigates model inversion attacks on the Llama 3.2 model, a multilingual LLM developed by Meta.
arXiv Detail & Related papers (2025-07-06T17:24:17Z)
Breaking the Gold Standard: Extracting Forgotten Data under Exact Unlearning in Large Language Models [26.5039481643457]
We introduce a novel data extraction attack that compromises even exact unlearning.<n>We demonstrate our attack's effectiveness on a simulated medical diagnosis dataset.
arXiv Detail & Related papers (2025-05-30T09:09:33Z)
ASIDE: Architectural Separation of Instructions and Data in Language Models [87.16417239344285]
ASIDE allows language models to clearly separate instructions and data at the level of embeddings.<n>We demonstrate experimentally across a range of models, instruction-tuning LLMs with ASIDE leads to highly increased instruction-data separation without a loss in model utility.<n>We provide insights into the mechanism underlying our method through an analysis of the model representations.
arXiv Detail & Related papers (2025-03-13T17:17:17Z)
Improved Localized Machine Unlearning Through the Lens of Memorization [23.30800397324838]
We study localized unlearning, where the unlearning algorithm operates on a small subset of parameters. We propose an improved localization strategy that yields strong results when paired with existing unlearning algorithms. We also propose a new unlearning algorithm, Deletion by Example localization (DEL), that resets the parameters deemed-to-be most critical.
arXiv Detail & Related papers (2024-12-03T12:57:08Z)
Extracting Unlearned Information from LLMs with Activation Steering [46.16882599881247]
Unlearning has emerged as a solution to remove sensitive knowledge from models after training. We propose activation steering as a method for exact information retrieval from unlearned models. Our results demonstrate that exact information retrieval from unlearned models is possible, highlighting a severe vulnerability of current unlearning techniques.
arXiv Detail & Related papers (2024-11-04T21:42:56Z)
CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept [5.345828824625758]
We propose a novel amortized unlearning approach using codebook features and Sparse Autoencoders (SAEs) By leveraging a bottleneck to decompose the activation space and regulate information flow, our method efficiently unlearns targeted information while preserving the model's performance on unrelated data.
arXiv Detail & Related papers (2024-10-08T10:26:22Z)
Learn while Unlearn: An Iterative Unlearning Framework for Generative Language Models [52.03511469562013]
We introduce the Iterative Contrastive Unlearning (ICU) framework, which consists of three core components. A Knowledge Unlearning Induction module targets specific knowledge for removal using an unlearning loss. A Contrastive Learning Enhancement module preserves the model's expressive capabilities against the pure unlearning goal. An Iterative Unlearning Refinement module dynamically adjusts the unlearning process through ongoing evaluation and updates.
arXiv Detail & Related papers (2024-07-25T07:09:35Z)
Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy. Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models. This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z)
Machine Unlearning in Large Language Models [0.7864304771129751]
This paper introduces a methodology to align large language models (LLMs) with ethical, privacy, and safety standards. Our approach aims to selectively erase or modify learned information in LLMs, targeting harmful responses and copyrighted content.
arXiv Detail & Related papers (2024-05-24T02:12:51Z)
The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements. LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information. Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z)
Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning Interference with Gradient Projection [56.292071534857946]
Recent data-privacy laws have sparked interest in machine unlearning. Challenge is to discard information about the forget'' data without altering knowledge about remaining dataset. We adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU) We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible.
arXiv Detail & Related papers (2023-12-07T07:17:24Z)
Unlearn What You Want to Forget: Efficient Unlearning for LLMs [92.51670143929056]
Large language models (LLMs) have achieved significant progress from pre-training on and memorizing a wide range of textual data. This process might suffer from privacy issues and violations of data protection regulations. We propose an efficient unlearning framework that could efficiently update LLMs without having to retrain the whole model after data removals.
arXiv Detail & Related papers (2023-10-31T03:35:59Z)
From Zero to Hero: Detecting Leaked Data through Synthetic Data Injection and Model Querying [10.919336198760808]
We introduce a novel methodology to detect leaked data that are used to train classification models. textscLDSS involves injecting a small volume of synthetic data--characterized by local shifts in class distribution--into the owner's dataset. This enables the effective identification of models trained on leaked data through model querying alone.
arXiv Detail & Related papers (2023-10-06T10:36:28Z)
AI Model Disgorgement: Methods and Choices [127.54319351058167]
We introduce a taxonomy of possible disgorgement methods that are applicable to modern machine learning systems. We investigate the meaning of "removing the effects" of data in the trained model in a way that does not require retraining from scratch.
arXiv Detail & Related papers (2023-04-07T08:50:18Z)
Reconstructing Training Data from Model Gradient, Provably [68.21082086264555]
We reconstruct the training samples from a single gradient query at a randomly chosen parameter value. As a provable attack that reveals sensitive training data, our findings suggest potential severe threats to privacy.
arXiv Detail & Related papers (2022-12-07T15:32:22Z)
Deep Regression Unlearning [6.884272840652062]
We introduce deep regression unlearning methods that generalize well and are robust to privacy attacks. We conduct regression unlearning experiments for computer vision, natural language processing and forecasting applications.
arXiv Detail & Related papers (2022-10-15T05:00:20Z)
Machine Unlearning of Features and Labels [72.81914952849334]
We propose first scenarios for unlearning and labels in machine learning models. Our approach builds on the concept of influence functions and realizes unlearning through closed-form updates of model parameters.
arXiv Detail & Related papers (2021-08-26T04:42:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.