Related papers: Machine Unlearning in Large Language Models

Machine Unlearning in Large Language Models

URL: http://arxiv.org/abs/2404.16841v1
Date: Sat, 3 Feb 2024 05:14:56 GMT
Title: Machine Unlearning in Large Language Models
Authors: Kongyang Chen, Zixin Wang, Bing Mi, Waixi Liu, Shaowei Wang, Xiaojun Ren, Jiaxing Shen,
Abstract summary: This paper introduces a novel machine unlearning framework into large language models. Our objectives are to make LLMs not produce harmful, hallucinatory, or privacy-compromising responses. Experimental results show that our approach effectively meets unlearning objectives without substantially compromising model performance.
Score: 8.14992136443131
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, large language models (LLMs) have emerged as a notable field, attracting significant attention for its ability to automatically generate intelligent contents for various application domains. However, LLMs still suffer from significant security and privacy issues. For example, LLMs might expose user privacy from hacking attacks or targeted prompts. To address this problem, this paper introduces a novel machine unlearning framework into LLMs. Our objectives are to make LLMs not produce harmful, hallucinatory, or privacy-compromising responses, while retaining their standard output capabilities. To accomplish this, we use an evaluative model to pinpoint dialogues needing unlearning. We also establish a distance loss to function as the model's negative loss, diverting it from previous undesirable outputs. Furthermore, we determine the expected output's cluster mean to formulate a positive loss, directing the model's outputs toward preferable outcomes without compromising its reasoning abilities and performance. Experimental results show that our approach effectively meets unlearning objectives without substantially compromising model performance.

Related papers

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs [44.8238758047607]
Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their training objectives.<n>We argue this not only risks reinforcing exposure to sensitive data, it also contradicts the principle of minimizing its use.<n>We propose a novel unlearning method - Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective.
arXiv Detail & Related papers (2025-07-06T03:08:49Z)
Large Language Model Unlearning for Source Code [65.42425213605114]
PROD is a novel unlearning approach that enables LLMs to forget undesired code content while preserving their code generation capabilities.<n>Our evaluation demonstrates that PROD achieves superior balance between forget quality and model utility compared to existing unlearning approaches.
arXiv Detail & Related papers (2025-06-20T16:27:59Z)
Rethinking Post-Unlearning Behavior of Large Vision-Language Models [17.951441278605966]
We introduce a new unlearning task for Large Vision-Language Models (LVLMs)<n>This task requires models to provide privacy-preserving yet informative and visually grounded responses.<n>We also propose, a novel unlearning method that explicitly guides post-unlearning behavior toward a desirable output distribution.
arXiv Detail & Related papers (2025-06-03T07:28:22Z)
Multi-Objective Large Language Model Unlearning [3.372396620898397]
Gradient Ascent (GA) is a proactive way to decrease the prediction probability of the model on the target data. We propose Multi-Objective Large Language Model Unlearning (MOLLM) algorithm to overcome gradient explosion and catastrophic forgetting. Our empirical results verify that MoLLM outperforms the SOTA GA-based LLM unlearning methods in terms of unlearning effect and model utility preservation.
arXiv Detail & Related papers (2024-12-29T09:35:56Z)
A Closer Look at Machine Unlearning for Large Language Models [46.245404272612795]
Large language models (LLMs) may memorize sensitive or copyrighted content, raising privacy and legal concerns. We discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches.
arXiv Detail & Related papers (2024-10-10T16:56:05Z)
Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models [27.397408870544453]
Large Language Models (LLMs) have become a focal point in the rapidly evolving field of artificial intelligence. A critical concern is the presence of toxic content within the pre-training corpus of these models, which can lead to the generation of inappropriate outputs. This paper proposes a target-driven attack paradigm that focuses on directly eliciting the target response instead of optimizing the prompts.
arXiv Detail & Related papers (2024-08-27T08:12:08Z)
MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z)
SNAP: Unlearning Selective Knowledge in Large Language Models with Negative Instructions [37.172662930947446]
Instruction-following large language models (LLMs) inadvertently disclose personal or copyrighted information. We propose SNAP, an innovative framework designed to selectively unlearn information. We evaluate our framework on various NLP benchmarks and demonstrate that our approach retains the original LLM capabilities.
arXiv Detail & Related papers (2024-06-18T06:54:05Z)
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z)
The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements. LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information. Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z)
Calibrating Large Language Models Using Their Generations Only [44.26441565763495]
APRICOT is a method to set confidence targets and train an additional model that predicts an LLM's confidence based on its textual input and output alone. It is conceptually simple, does not require access to the target model beyond its output, does not interfere with the language generation, and has a multitude of potential usages. We show how our approach performs competitively in terms of calibration error for white-box and black-box LLMs on closed-book question-answering to detect incorrect LLM answers.
arXiv Detail & Related papers (2024-03-09T17:46:24Z)
Rethinking Machine Unlearning for Large Language Models [85.92660644100582]
We explore machine unlearning in the domain of large language models (LLMs) This initiative aims to eliminate undesirable data influence (e.g., sensitive or illegal information) and the associated model capabilities.
arXiv Detail & Related papers (2024-02-13T20:51:58Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Small Language Models Improve Giants by Rewriting Their Outputs [18.025736098795296]
We tackle the problem of leveraging training data to improve the performance of large language models (LLMs) without fine-tuning. We create a pool of candidates from the LLM through few-shot prompting and we employ a compact model, the LM-corrector (LMCor), specifically trained to merge these candidates to produce an enhanced output. Experiments on four natural language generation tasks demonstrate that even a small LMCor model (250M) substantially improves the few-shot performance of LLMs (62B), matching and even outperforming standard fine-tuning.
arXiv Detail & Related papers (2023-05-22T22:07:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.