Towards Safer Large Language Models through Machine Unlearning
- URL: http://arxiv.org/abs/2402.10058v2
- Date: Wed, 5 Jun 2024 04:57:09 GMT
- Title: Towards Safer Large Language Models through Machine Unlearning
- Authors: Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, Meng Jiang,
- Abstract summary: Selective Knowledge Unlearning ( SKU) is designed to eliminate harmful knowledge while preserving utility on normal prompts.
First stage aims to identify and acquire harmful knowledge within the model, whereas the second is dedicated to remove this knowledge.
Our experiments demonstrate that SKU identifies a good balance point between removing harmful information and preserving utility.
- Score: 19.698620794387338
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The rapid advancement of Large Language Models (LLMs) has demonstrated their vast potential across various domains, attributed to their extensive pretraining knowledge and exceptional generalizability. However, LLMs often encounter challenges in generating harmful content when faced with problematic prompts. To address this problem, existing work attempted to implement a gradient ascent based approach to prevent LLMs from producing harmful output. While these methods can be effective, they frequently impact the model utility in responding to normal prompts. To address this gap, we introduce Selective Knowledge negation Unlearning (SKU), a novel unlearning framework for LLMs, designed to eliminate harmful knowledge while preserving utility on normal prompts. Specifically, SKU is consisted of two stages: harmful knowledge acquisition stage and knowledge negation stage. The first stage aims to identify and acquire harmful knowledge within the model, whereas the second is dedicated to remove this knowledge. SKU selectively isolates and removes harmful knowledge in model parameters, ensuring the model's performance remains robust on normal prompts. Our experiments conducted across various LLM architectures demonstrate that SKU identifies a good balance point between removing harmful information and preserving utility.
Related papers
- SNAP: Unlearning Selective Knowledge in Large Language Models with Negative Instructions [37.172662930947446]
Instruction-following large language models (LLMs) inadvertently disclose personal or copyrighted information.
We propose SNAP, an innovative framework designed to selectively unlearn information.
We evaluate our framework on various NLP benchmarks and demonstrate that our approach retains the original LLM capabilities.
arXiv Detail & Related papers (2024-06-18T06:54:05Z) - Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs [18.629717934007513]
"SPlit, UNlearn, MerGE" (SPUNGE) is a framework that can be used with any unlearning method to amplify its effectiveness.
We empirically demonstrate that SPUNGE significantly improves the performance of two recent unlearning methods on state-of-the-art LLMs.
arXiv Detail & Related papers (2024-06-17T17:35:52Z) - Large Language Model Unlearning via Embedding-Corrupted Prompts [10.889859281637406]
We present Embedding-COrrupted (ECO) Prompts, a lightweight unlearning framework for large language models.
We enforce an unlearned state during inference by employing a prompt classifier to identify and safeguard prompts to forget.
We find that these embedding-corrupted prompts not only lead to desirable outputs that satisfy the unlearning objective but also closely approximate the output from a model that has never been trained on the data intended for forgetting.
arXiv Detail & Related papers (2024-06-12T06:56:20Z) - Understanding Privacy Risks of Embeddings Induced by Large Language Models [75.96257812857554]
Large language models show early signs of artificial general intelligence but struggle with hallucinations.
One promising solution is to store external knowledge as embeddings, aiding LLMs in retrieval-augmented generation.
Recent studies experimentally showed that the original text can be partially reconstructed from text embeddings by pre-trained language models.
arXiv Detail & Related papers (2024-04-25T13:10:48Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by
Dissociating Language and Cognition [57.747888532651]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - A Comprehensive Study of Knowledge Editing for Large Language Models [82.65729336401027]
Large Language Models (LLMs) have shown extraordinary capabilities in understanding and generating text that closely mirrors human communication.
This paper defines the knowledge editing problem and provides a comprehensive review of cutting-edge approaches.
We introduce a new benchmark, KnowEdit, for a comprehensive empirical evaluation of representative knowledge editing approaches.
arXiv Detail & Related papers (2024-01-02T16:54:58Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - Knowledge Unlearning for LLMs: Tasks, Methods, and Challenges [11.228131492745842]
Large language models (LLMs) have spurred a new research paradigm in natural language processing.
Despite their excellent capability in knowledge-based question answering and reasoning, their potential to retain faulty or even harmful knowledge poses risks of malicious application.
Knowledge unlearning, derived from analogous studies on machine unlearning, presents a promising avenue to address this concern.
arXiv Detail & Related papers (2023-11-27T12:37:51Z) - Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake
Analysis [127.85293480405082]
The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges.
Existing alignment methods usually direct LLMs toward the favorable outcomes by utilizing human-annotated, flawless instruction-response pairs.
This study proposes a novel alignment technique based on mistake analysis, which deliberately exposes LLMs to erroneous content to learn the reasons for mistakes and how to avoid them.
arXiv Detail & Related papers (2023-10-16T14:59:10Z) - Knowledge Sanitization of Large Language Models [4.722882736419499]
Large language models (LLMs) trained on a large corpus of Web data can potentially reveal sensitive or confidential information.
Our technique efficiently fine-tunes these models using the Low-Rank Adaptation (LoRA) method.
Experimental results in a closed-book question-answering task show that our straightforward method not only minimizes particular knowledge leakage but also preserves the overall performance of LLMs.
arXiv Detail & Related papers (2023-09-21T07:49:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.