Feature-Selective Representation Misdirection for Machine Unlearning
- URL: http://arxiv.org/abs/2512.16297v1
- Date: Thu, 18 Dec 2025 08:31:50 GMT
- Title: Feature-Selective Representation Misdirection for Machine Unlearning
- Authors: Taozhao Chen, Linghan Huang, Kim-Kwang Raymond Choo, Huaming Chen,
- Abstract summary: Machine unlearning can help ensure deployed models comply with evolving legal, safety, and governance requirements.<n>Current unlearning techniques assume clean separation between forget and retain datasets.<n>We propose Selective Representation Misdirection for Unlearning (SRMU), a novel principled activation-editing framework.
- Score: 34.167873590478074
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models (LLMs) are increasingly adopted in safety-critical and regulated sectors, the retention of sensitive or prohibited knowledge introduces escalating risks, ranging from privacy leakage to regulatory non-compliance to to potential misuse, and so on. Recent studies suggest that machine unlearning can help ensure deployed models comply with evolving legal, safety, and governance requirements. However, current unlearning techniques assume clean separation between forget and retain datasets, which is challenging in operational settings characterized by highly entangled distributions. In such scenarios, perturbation-based methods often degrade general model utility or fail to ensure safety. To address this, we propose Selective Representation Misdirection for Unlearning (SRMU), a novel principled activation-editing framework that enforces feature-aware and directionally controlled perturbations. Unlike indiscriminate model weights perturbations, SRMU employs a structured misdirection vector with an activation importance map. The goal is to allow SRMU selectively suppresses harmful representations while preserving the utility on benign ones. Experiments are conducted on the widely used WMDP benchmark across low- and high-entanglement configurations. Empirical results reveal that SRMU delivers state-of-the-art unlearning performance with minimal utility losses, and remains effective under 20-30\% overlap where existing baselines collapse. SRMU provides a robust foundation for safety-driven model governance, privacy compliance, and controlled knowledge removal in the emerging LLM-based applications. We release the replication package at https://figshare.com/s/d5931192a8824de26aff.
Related papers
- U-CAN: Utility-Aware Contrastive Attenuation for Efficient Unlearning in Generative Recommendation [9.680511155102623]
We propose Contrastive AttenuatioN (U-CAN), a precision unlearning framework that operates on low-rank adapters.<n>U-CAN quantifies risk by contrasting activations and focuses on neurons with asymmetric responses that are highly sensitive to the forgetting set but suppressed on the retention set.<n>Unlike binary pruning, which often fragments network structure, U-CAN develop adaptive soft attenuation with a differentiable decay function.
arXiv Detail & Related papers (2026-02-26T07:36:11Z) - MeGU: Machine-Guided Unlearning with Target Feature Disentanglement [73.49657372882082]
We propose a novel framework that guides unlearning through concept-aware re-alignment.<n>MeGU enables controlled and selective forgetting, effectively mitigating both under-unlearning and over-unlearning.
arXiv Detail & Related papers (2026-02-19T05:20:31Z) - Safe Reinforcement Learning via Recovery-based Shielding with Gaussian Process Dynamics Models [57.006252510102506]
Reinforcement learning (RL) is a powerful framework for optimal decision-making and control but often lacks provable guarantees for safety-critical applications.<n>We introduce a novel recovery-based shielding framework that enables safe RL with a provable safety lower bound for unknown and non-linear continuous dynamical systems.
arXiv Detail & Related papers (2026-02-12T22:03:35Z) - Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models [62.16655896700062]
Activation steering is a technique to enhance the utility of Large Language Models (LLMs)<n>We show that it unintentionally introduces critical and under-explored safety risks.<n>Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks.
arXiv Detail & Related papers (2026-02-03T12:32:35Z) - Mitigating Safety Tax via Distribution-Grounded Refinement in Large Reasoning Models [63.368505631152594]
Safety alignment incurs safety tax that perturbs a large reasoning model's (LRM) general reasoning ability.<n>Existing datasets used for safety alignment for an LRM are usually constructed by distilling safety reasoning traces and answers from an external LRM or human labeler.<n>We propose a safety alignment dataset construction method, dubbed DGR. DGR transforms and refines an existing out-of-distributional safety reasoning dataset to be aligned with the target's LLM inner distribution.
arXiv Detail & Related papers (2026-02-02T14:18:48Z) - SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models [67.84174763413178]
We introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection.<n>We show that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks.
arXiv Detail & Related papers (2026-01-13T15:01:38Z) - FROC: A Unified Framework with Risk-Optimized Control for Machine Unlearning in LLMs [28.687949604557986]
We propose FROC, a unified framework with Risk-d Control for machine unlearning in large language models (LLMs)<n>FROC is built around a conformal-style risk-control formulation that expresses a user-specified risk budget on unlearning behavior.<n> Experiments across multiple LLM MU methods demonstrate that FROC produces stable, interpretable risk landscapes.
arXiv Detail & Related papers (2025-12-15T13:53:12Z) - EReLiFM: Evidential Reliability-Aware Residual Flow Meta-Learning for Open-Set Domain Generalization under Noisy Labels [85.78886153628663]
Open-Set Domain Generalization aims to enable deep learning models to recognize unseen categories in new domains.<n>Label noise hinders open-set domain generalization by corrupting source-domain knowledge.<n>We propose Evidential Reliability-Aware Residual Flow Meta-Learning (EReLiFM) to bridge domain gaps.
arXiv Detail & Related papers (2025-10-14T16:23:11Z) - OFMU: Optimization-Driven Framework for Machine Unlearning [5.100622189286672]
Large language models increasingly require the ability to unlearn specific knowledge, such as user requests, copyrighted materials, or outdated information.<n>We propose OFMU, a penalty-based bi-level optimization framework that explicitly prioritizes forgetting while preserving retention.<n>We show that OFMU consistently outperforms existing unlearning methods in both efficacy and retained utility.
arXiv Detail & Related papers (2025-09-26T15:31:32Z) - Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning [19.823784666021822]
ACTOR minimizes over-refusals by leveraging internal activation patterns from diverse queries.<n>ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism.
arXiv Detail & Related papers (2025-07-06T05:47:04Z) - Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning [46.170140576473365]
Machine unlearning offers a promising solution to privacy and safety concerns in large language models (LLMs)<n>We introduce invariance into unlearning for the first time, inspired by invariant risk minimization (IRM)<n>We propose invariant LLM unlearning (ILU), a regularization-based framework that enhances robustness.
arXiv Detail & Related papers (2025-06-02T05:38:43Z) - Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM)<n>To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z) - Uncertainty-Aware Source-Free Adaptive Image Super-Resolution with Wavelet Augmentation Transformer [60.31021888394358]
Unsupervised Domain Adaptation (UDA) can effectively address domain gap issues in real-world image Super-Resolution (SR)
We propose a SOurce-free Domain Adaptation framework for image SR (SODA-SR) to address this issue, i.e., adapt a source-trained model to a target domain with only unlabeled target data.
arXiv Detail & Related papers (2023-03-31T03:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.