Who's Harry Potter? Approximate Unlearning in LLMs
- URL: http://arxiv.org/abs/2310.02238v2
- Date: Wed, 4 Oct 2023 05:20:19 GMT
- Title: Who's Harry Potter? Approximate Unlearning in LLMs
- Authors: Ronen Eldan and Mark Russinovich
- Abstract summary: Large language models (LLMs) are trained on massive internet corpora that often contain copyrighted content.
This poses legal and ethical challenges for the developers and users of these models, as well as the original authors and publishers.
We propose a novel technique for unlearning a subset of the training data from a LLM, without having to retrain it from scratch.
- Score: 4.821438899378393
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are trained on massive internet corpora that
often contain copyrighted content. This poses legal and ethical challenges for
the developers and users of these models, as well as the original authors and
publishers. In this paper, we propose a novel technique for unlearning a subset
of the training data from a LLM, without having to retrain it from scratch.
We evaluate our technique on the task of unlearning the Harry Potter books
from the Llama2-7b model (a generative language model recently open-sourced by
Meta). While the model took over 184K GPU-hours to pretrain, we show that in
about 1 GPU hour of finetuning, we effectively erase the model's ability to
generate or recall Harry Potter-related content, while its performance on
common benchmarks (such as Winogrande, Hellaswag, arc, boolq and piqa) remains
almost unaffected. We make our fine-tuned model publicly available on
HuggingFace for community evaluation. To the best of our knowledge, this is the
first paper to present an effective technique for unlearning in generative
language models.
Our technique consists of three main components: First, we use a reinforced
model that is further trained on the target data to identify the tokens that
are most related to the unlearning target, by comparing its logits with those
of a baseline model. Second, we replace idiosyncratic expressions in the target
data with generic counterparts, and leverage the model's own predictions to
generate alternative labels for every token. These labels aim to approximate
the next-token predictions of a model that has not been trained on the target
data. Third, we finetune the model on these alternative labels, which
effectively erases the original text from the model's memory whenever it is
prompted with its context.
Related papers
- How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold [50.33428591760124]
We study the relationship between a concept's frequency in the training dataset and the ability of a model to imitate it.
We propose an efficient approach that estimates the imitation threshold without incurring the colossal cost of training multiple models from scratch.
arXiv Detail & Related papers (2024-10-19T06:28:14Z) - FOCUS: Forging Originality through Contrastive Use in Self-Plagiarism for Language Models [38.76912842622624]
Pre-trained Language Models (PLMs) have shown impressive results in various Natural Language Generation (NLG) tasks.
This study introduces a unique "self-plagiarism" contrastive decoding strategy, aimed at boosting the originality of text produced by PLMs.
arXiv Detail & Related papers (2024-06-02T19:17:00Z) - Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models [4.081098869497239]
We develop state-of-the-art privacy attacks against Large Language Models (LLMs)
New membership inference attacks (MIAs) against pretrained LLMs perform hundreds of times better than baseline attacks.
In fine-tuning, we find that a simple attack based on the ratio of the loss between the base and fine-tuned models is able to achieve near-perfect MIA performance.
arXiv Detail & Related papers (2024-02-26T20:41:50Z) - CodeArt: Better Code Models by Attention Regularization When Symbols Are
Lacking [12.458135956476639]
Transformer based code models have impressive performance in many software engineering tasks.
However, their effectiveness degrades when symbols are missing or not informative.
We propose a new method to pre-train general code models when symbols are lacking.
arXiv Detail & Related papers (2024-02-19T05:13:22Z) - Scalable Extraction of Training Data from (Production) Language Models [93.7746567808049]
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset.
We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT.
arXiv Detail & Related papers (2023-11-28T18:47:03Z) - Generate to Understand for Representation [3.5325087487696463]
GUR is a pretraining framework that combines language modeling and contrastive learning objectives in a single training step.
GUR achieves impressive results without any labeled training data, outperforming all other pretrained baselines as a retriever at the recall benchmark in a zero-shot setting.
arXiv Detail & Related papers (2023-06-14T06:00:18Z) - Robust Preference Learning for Storytelling via Contrastive
Reinforcement Learning [53.92465205531759]
Controlled automated story generation seeks to generate natural language stories satisfying constraints from natural language critiques or preferences.
We train a contrastive bi-encoder model to align stories with human critiques, building a general purpose preference model.
We further fine-tune the contrastive reward model using a prompt-learning technique to increase story generation robustness.
arXiv Detail & Related papers (2022-10-14T13:21:33Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - To Transfer or Not to Transfer: Misclassification Attacks Against
Transfer Learned Text Classifiers [10.762008415887195]
We present novel attack techniques that utilize unintended features learnt in the teacher (public) model to generate adversarial examples for student (downstream) models.
First, we propose a novel word-score based attack algorithm for generating adversarial examples against student models trained using context-free word-level embedding model.
Next, we present length-based and sentence-based misclassification attacks for the Fake News Detection task trained using a context-aware BERT model.
arXiv Detail & Related papers (2020-01-08T10:26:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.