Related papers: Don't Forget It! Conditional Sparse Autoencoder Clamping Works for Unlearning

Don't Forget It! Conditional Sparse Autoencoder Clamping Works for Unlearning

URL: http://arxiv.org/abs/2503.11127v1
Date: Fri, 14 Mar 2025 06:43:19 GMT
Title: Don't Forget It! Conditional Sparse Autoencoder Clamping Works for Unlearning
Authors: Matthew Khoriaty, Andrii Shportko, Gustavo Mercier, Zach Wood-Doughty,
Abstract summary: Large Language Model (LLM) capabilities have brought great potential but also posed new risks.<n>For example, LLMs with knowledge of bioweapons, advanced chemistry, or cyberattacks could cause violence if placed in the wrong hands or during malfunctions.<n>Because of their nature as near-black boxes, intuitive interpretation of LLM internals remains an open research question.
Score: 0.306238659426286
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent developments in Large Language Model (LLM) capabilities have brought great potential but also posed new risks. For example, LLMs with knowledge of bioweapons, advanced chemistry, or cyberattacks could cause violence if placed in the wrong hands or during malfunctions. Because of their nature as near-black boxes, intuitive interpretation of LLM internals remains an open research question, preventing developers from easily controlling model behavior and capabilities. The use of Sparse Autoencoders (SAEs) has recently emerged as a potential method of unraveling representations of concepts in LLMs internals, and has allowed developers to steer model outputs by directly modifying the hidden activations. In this paper, we use SAEs to identify unwanted concepts from the Weapons of Mass Destruction Proxy (WMDP) dataset within gemma-2-2b internals and use feature steering to reduce the model's ability to answer harmful questions while retaining its performance on harmless queries. Our results bring back optimism to the viability of SAE-based explicit knowledge unlearning techniques.

Related papers

Model Unlearning via Sparse Autoencoder Subspace Guided Projections [34.47648738350138]
Large language models (LLMs) store vast amounts of information, making them powerful yet raising privacy and safety concerns.<n>Existing unlearning strategies, ranging from gradient-based fine-tuning and model editing to sparse autoencoder steering, either lack interpretability or fail to provide a robust defense against adversarial prompts.<n>We propose SAE-Guided Subspace Projection Unlearning (SSPU), a novel framework that leverages SAE features to drive targeted updates in the model's parameter space.
arXiv Detail & Related papers (2025-05-30T10:07:52Z)
Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling [56.26834106704781]
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs)<n>We provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation.<n>Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers.
arXiv Detail & Related papers (2025-05-27T16:24:02Z)
Internal Activation as the Polar Star for Steering Unsafe LLM Behavior [50.463399903987245]
We introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model's internal states.<n>Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility.
arXiv Detail & Related papers (2025-02-03T04:23:33Z)
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering [23.96385393039587]
Large language models (LLMs) can store a significant amount of factual knowledge in their parameters.<n>LLMs can internally register the signals of knowledge conflict at mid-layers.<n>We propose textscSpARE, a representation engineering method that uses pre-trained sparse auto-encoders.
arXiv Detail & Related papers (2024-10-21T13:30:47Z)
When Machine Unlearning Meets Retrieval-Augmented Generation (RAG): Keep Secret or Forget Knowledge? [15.318301783084681]
Large language models (LLMs) can inadvertently learn and retain sensitive information and harmful content during training. We propose a lightweight unlearning framework based on Retrieval-Augmented Generation (RAG) technology. We evaluate our framework through extensive experiments on both open-source and closed-source models, including ChatGPT, Gemini, Llama-2-7b-chat-hf, and PaLM 2.
arXiv Detail & Related papers (2024-10-20T03:51:01Z)
Answer When Needed, Forget When Not: Language Models Pretend to Forget via In-Context Knowledge Unlearning [26.861562920084264]
Large language models (LLMs) are applied across diverse domains. We propose a novel method termed in-context knowledge unlearning'' Our method fine-tunes pre-trained LLMs to enable prompt unlearning of target knowledge within the context.
arXiv Detail & Related papers (2024-10-01T04:13:25Z)
Atoxia: Red-teaming Large Language Models with Target Toxic Answers [27.397408870544453]
Atoxia can successfully detect safety risks in not only open-source models but also state-of-the-art black-box models such as GPT-4o.
arXiv Detail & Related papers (2024-08-27T08:12:08Z)
MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z)
LLMs can learn self-restraint through iterative self-reflection [57.26854891567574]
Large Language Models (LLMs) must be capable of dynamically adapting their behavior based on their level of knowledge and uncertainty associated with specific topics. This adaptive behavior, which we refer to as self-restraint, is non-trivial to teach. We devise a utility function that can encourage the model to produce responses only when it is confident in them.
arXiv Detail & Related papers (2024-05-15T13:35:43Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
LLMs' Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements [59.71218039095155]
Task of reading comprehension (RC) provides a primary means to assess language models' natural language understanding (NLU) capabilities. If the context aligns with the models' internal knowledge, it is hard to discern whether the models' answers stem from context comprehension or from internal information. To address this issue, we suggest to use RC on imaginary data, based on fictitious facts and entities.
arXiv Detail & Related papers (2024-04-09T13:08:56Z)
Tuning-Free Accountable Intervention for LLM Deployment -- A Metacognitive Approach [55.613461060997004]
Large Language Models (LLMs) have catalyzed transformative advances across a spectrum of natural language processing tasks. We propose an innovative textitmetacognitive approach, dubbed textbfCLEAR, to equip LLMs with capabilities for self-aware error identification and correction.
arXiv Detail & Related papers (2024-03-08T19:18:53Z)
UNDIAL: Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models [12.45822383965784]
We introduce UnDIAL (Unlearning via Self-Distillation on Adjusted Logits), a novel and robust unlearning method. Our approach leverages self-distillation to adjust logits and selectively reduce the influence of targeted tokens.
arXiv Detail & Related papers (2024-02-15T16:21:14Z)
Machine Unlearning in Large Language Models [8.14992136443131]
This paper introduces a novel machine unlearning framework into large language models. Our objectives are to make LLMs not produce harmful, hallucinatory, or privacy-compromising responses. Experimental results show that our approach effectively meets unlearning objectives without substantially compromising model performance.
arXiv Detail & Related papers (2024-02-03T05:14:56Z)
Open Sesame! Universal Black Box Jailbreaking of Large Language Models [0.0]
Large language models (LLMs) are designed to provide helpful and safe responses. LLMs often rely on alignment techniques to align with user intent and social guidelines. We introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible.
arXiv Detail & Related papers (2023-09-04T08:54:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.