LUNAR: LLM Unlearning via Neural Activation Redirection
- URL: http://arxiv.org/abs/2502.07218v1
- Date: Tue, 11 Feb 2025 03:23:22 GMT
- Title: LUNAR: LLM Unlearning via Neural Activation Redirection
- Authors: William F. Shen, Xinchi Qiu, Meghdad Kurmanji, Alex Iacob, Lorenzo Sani, Yihong Chen, Nicola Cancedda, Nicholas D. Lane,
- Abstract summary: Large Language Models (LLMs) benefit from training on ever larger amounts of textual data, but they increasingly incur the risk of leaking private information.
We propose LUNAR, a novel unlearning methodology grounded in the Linear Representation Hypothesis.
We show that LUNAR achieves state-of-the-art unlearning performance while significantly enhancing the controllability of the unlearned model during inference.
- Score: 20.60687563657169
- License:
- Abstract: Large Language Models (LLMs) benefit from training on ever larger amounts of textual data, but as a result, they increasingly incur the risk of leaking private information. The ability to selectively remove knowledge from LLMs is, therefore, a highly desirable capability. In this paper, we propose LUNAR, a novel unlearning methodology grounded in the Linear Representation Hypothesis. LUNAR operates by redirecting the representations of unlearned data to regions that trigger the model's inherent ability to express its inability to answer. LUNAR achieves state-of-the-art unlearning performance while significantly enhancing the controllability of the unlearned model during inference. Specifically, LUNAR achieves between 2.9x to 11.7x improvements on combined "unlearning efficacy" and "model utility" score ("Deviation Score") on the PISTOL dataset across various base models. We also demonstrate, through quantitative analysis and qualitative examples, LUNAR's superior controllability in generating coherent and contextually aware responses, mitigating undesired side effects of existing methods. Moreover, we demonstrate that LUNAR is robust against white-box adversarial attacks and versatile in handling real-world scenarios, such as processing sequential unlearning requests.
Related papers
- Clear Minds Think Alike: What Makes LLM Fine-tuning Robust? A Study of Token Perplexity [61.48338027901318]
We show that fine-tuning with LLM-generated data improves target task performance and reduces out-of-domain degradation.
This is the first mechanistic explanation for the superior OOD robustness conferred by LLM-generated training data.
arXiv Detail & Related papers (2025-01-24T08:18:56Z) - Exploring Knowledge Boundaries in Large Language Models for Retrieval Judgment [56.87031484108484]
Large Language Models (LLMs) are increasingly recognized for their practical applications.
Retrieval-Augmented Generation (RAG) tackles this challenge and has shown a significant impact on LLMs.
By minimizing retrieval requests that yield neutral or harmful results, we can effectively reduce both time and computational costs.
arXiv Detail & Related papers (2024-11-09T15:12:28Z) - Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL [7.988692259455583]
Large language models (LLMs) trained with Reinforcement Learning from Human Feedback have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque.
This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions.
We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 80.40% accuracy in predicting human preferences.
arXiv Detail & Related papers (2024-10-16T12:14:25Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Large Language Models are Biased Reinforcement Learners [0.0]
We show that large language models (LLMs) exhibit behavioral signatures of a relative value bias.
Computational cognitive modeling reveals that LLM behavior is well-described by a simple RL algorithm.
arXiv Detail & Related papers (2024-05-19T01:43:52Z) - LLM In-Context Recall is Prompt Dependent [0.0]
A model's ability to do this significantly influences its practical efficacy and dependability in real-world applications.
This study demonstrates that an LLM's recall capability is not only contingent upon the prompt's content but also may be compromised by biases in its training data.
arXiv Detail & Related papers (2024-04-13T01:13:59Z) - Balancing Exploration and Exploitation in LLM using Soft RLLF for
Enhanced Negation Understanding [4.799288023353623]
Finetuning approaches in NLP often focus on exploitation rather than exploration, which may lead to suboptimal models.
We leverage Reinforcement Learning from Logical Feedback to create an effective balance between exploration and exploitation in language models.
This has implications for the development of more accurate, reliable, and logically consistent language models in high-stakes domains.
arXiv Detail & Related papers (2024-03-02T11:54:55Z) - Learning Objective-Specific Active Learning Strategies with Attentive
Neural Processes [72.75421975804132]
Learning Active Learning (LAL) suggests to learn the active learning strategy itself, allowing it to adapt to the given setting.
We propose a novel LAL method for classification that exploits symmetry and independence properties of the active learning problem.
Our approach is based on learning from a myopic oracle, which gives our model the ability to adapt to non-standard objectives.
arXiv Detail & Related papers (2023-09-11T14:16:37Z) - Improving Open Information Extraction with Large Language Models: A
Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text.
Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z) - Large Language Models Are Latent Variable Models: Explaining and Finding
Good Demonstrations for In-Context Learning [104.58874584354787]
In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning.
This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models.
arXiv Detail & Related papers (2023-01-27T18:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.