From Logits to Latents: Contrastive Representation Shaping for LLM Unlearning
- URL: http://arxiv.org/abs/2601.22028v1
- Date: Thu, 29 Jan 2026 17:34:37 GMT
- Title: From Logits to Latents: Contrastive Representation Shaping for LLM Unlearning
- Authors: Haoran Tang, Rajiv Khanna,
- Abstract summary: We introduce CLReg, a representation regularizer that identifies forget features while pushing them away from retain features.<n>We provide first theoretical insights that relate representation shaping to entanglement reduction.<n> CLReg decreases forget-retain representation entanglement that facilitates mainstream unlearning methods without positing extra privacy risks.
- Score: 13.726373414710137
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most LLM unlearning methods aim to approximate retrain-from-scratch behaviors with minimal distribution shift, often via alignment-style objectives defined in the prediction space. While effective at reducing forgotten content generation, such approaches may act as suppression: forgotten concepts can persist in representations and remain entangled with retained knowledge. We introduce CLReg, a contrastive representation regularizer that identifies forget features while pushing them away from retain features, explicitly reducing forget-retain interference with minimal shifts on retain features. We provide first theoretical insights that relate representation shaping to entanglement reduction. Across unlearning benchmarks and LLMs of different sizes, CLReg decreases forget-retain representation entanglement that facilitates mainstream unlearning methods without positing extra privacy risks, inspiring future work that reshapes the representation space to remove forget concepts.
Related papers
- MeGU: Machine-Guided Unlearning with Target Feature Disentanglement [73.49657372882082]
We propose a novel framework that guides unlearning through concept-aware re-alignment.<n>MeGU enables controlled and selective forgetting, effectively mitigating both under-unlearning and over-unlearning.
arXiv Detail & Related papers (2026-02-19T05:20:31Z) - ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models [12.021923446217722]
Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models.<n>Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations.<n>We introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem.
arXiv Detail & Related papers (2026-01-30T21:56:50Z) - CAPTAIN: Semantic Feature Injection for Memorization Mitigation in Text-to-Image Diffusion Models [60.610268549138375]
Diffusion models can unintentionally reproduce training examples, raising privacy and copyright concerns.<n>We introduce CAPTAIN, a training-free framework that mitigates memorization by directly modifying latent features during denoising.
arXiv Detail & Related papers (2025-12-11T14:01:47Z) - Unconsciously Forget: Mitigating Memorization; Without Knowing What is being Memorized [41.5028352241977]
Memorizing training data can lead to legal challenges, including copyright infringement, violations of portrait rights, and trademark violations.<n>Our work demonstrates that specific parts of the model are responsible for copyrighted content generation.<n>By applying model pruning, we can effectively suppress the probability of generating copyrighted content without targeting specific concepts.
arXiv Detail & Related papers (2025-12-10T14:36:12Z) - Sparse Attention Post-Training for Mechanistic Interpretability [55.030850996535776]
We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance.<n>Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 1B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $approx 0.3 % of its edges.
arXiv Detail & Related papers (2025-12-05T16:40:08Z) - Towards Benign Memory Forgetting for Selective Multimodal Large Language Model Unlearning [49.274436951541425]
Multimodal Large Language Models (MLLMs) achieve remarkable capabilities but can inadvertently memorize privacy-sensitive information.<n>Existing unlearning methods fail to achieve benign forgetting because they often degrade the model's general image understanding performance.<n>We propose the Sculpted Memory Forgetting Adapter (SMFA), which confines forgetting to targeted memory regions while preserving overall capabilities.
arXiv Detail & Related papers (2025-11-25T11:22:45Z) - LLM Unlearning on Noisy Forget Sets: A Study of Incomplete, Rewritten, and Watermarked Data [69.5099112089508]
Large language models (LLMs) exhibit remarkable generative capabilities but raise ethical and security concerns by memorizing sensitive data.<n>This work presents the first study of unlearning under perturbed or low-fidelity forget data, referred to as noisy forget sets.<n>We find that unlearning remains surprisingly robust to perturbations, provided that core semantic signals are preserved.
arXiv Detail & Related papers (2025-10-10T05:10:49Z) - Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts [79.18608192761512]
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable.<n>We propose a Few-Shot Prototypical Concept Classification framework that mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment.<n>Our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot classification.
arXiv Detail & Related papers (2025-06-05T06:39:43Z) - Chroma-VAE: Mitigating Shortcut Learning with Generative Classifiers [44.97660597940641]
We show that generative models alone are not sufficient to prevent shortcut learning.
In particular, we propose Chroma-VAE, a two-pronged approach where a VAE is initially trained to isolate the shortcut in a small latent subspace.
In addition to demonstrating the efficacy of Chroma-VAE on benchmark and real-world shortcut learning tasks, our work highlights the potential for manipulating the latent space of generative classifiers to isolate or interpret specific correlations.
arXiv Detail & Related papers (2022-11-28T11:27:50Z) - Remembering for the Right Reasons: Explanations Reduce Catastrophic
Forgetting [100.75479161884935]
We propose a novel training paradigm called Remembering for the Right Reasons (RRR)
RRR stores visual model explanations for each example in the buffer and ensures the model has "the right reasons" for its predictions.
We demonstrate how RRR can be easily added to any memory or regularization-based approach and results in reduced forgetting.
arXiv Detail & Related papers (2020-10-04T10:05:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.