Related papers: Does Multimodal Large Language Model Truly Unlearn? Stealthy MLLM Unlearning Attack

Does Multimodal Large Language Model Truly Unlearn? Stealthy MLLM Unlearning Attack

URL: http://arxiv.org/abs/2506.17265v1
Date: Tue, 10 Jun 2025 04:52:03 GMT
Title: Does Multimodal Large Language Model Truly Unlearn? Stealthy MLLM Unlearning Attack
Authors: Xianren Zhang, Hui Liu, Delvin Ce Zhang, Xianfeng Tang, Qi He, Dongwon Lee, Suhang Wang,
Abstract summary: Multimodal Large Language Models (MLLMs) trained on massive data may memorize sensitive personal information and photos, posing serious privacy risks.<n> MLLM unlearning methods are proposed, which fine-tune MLLMs to reduce the forget'' sensitive information.<n>We study a novel problem of LLM unlearning attack, which aims to recover the unlearned knowledge of an unlearned LLM.
Score: 39.31635005360959
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Multimodal Large Language Models (MLLMs) trained on massive data may memorize sensitive personal information and photos, posing serious privacy risks. To mitigate this, MLLM unlearning methods are proposed, which fine-tune MLLMs to reduce the ``forget'' sensitive information. However, it remains unclear whether the knowledge has been truly forgotten or just hidden in the model. Therefore, we propose to study a novel problem of LLM unlearning attack, which aims to recover the unlearned knowledge of an unlearned LLM. To achieve the goal, we propose a novel framework Stealthy Unlearning Attack (SUA) framework that learns a universal noise pattern. When applied to input images, this noise can trigger the model to reveal unlearned content. While pixel-level perturbations may be visually subtle, they can be detected in the semantic embedding space, making such attacks vulnerable to potential defenses. To improve stealthiness, we introduce an embedding alignment loss that minimizes the difference between the perturbed and denoised image embeddings, ensuring the attack is semantically unnoticeable. Experimental results show that SUA can effectively recover unlearned information from MLLMs. Furthermore, the learned noise generalizes well: a single perturbation trained on a subset of samples can reveal forgotten content in unseen images. This indicates that knowledge reappearance is not an occasional failure, but a consistent behavior.

Related papers

Recalling The Forgotten Class Memberships: Unlearned Models Can Be Noisy Labelers to Leak Privacy [13.702759117522447]
Current limited research on Machine Unlearning (MU) attacks requires access to original models containing privacy data.<n>We propose an innovative study on recalling the forgotten class memberships from unlearned models without requiring access to the original one.<n>Our study and evaluation have established a benchmark for future research on MU vulnerabilities.
arXiv Detail & Related papers (2025-06-24T10:21:10Z)
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation [88.78166077081912]
We introduce a multimodal unlearning benchmark, UnLOK-VQA, and an attack-and-defense framework to evaluate methods for deleting specific multimodal knowledge from MLLMs.<n>Our results show multimodal attacks outperform text- or image-only ones, and that the most effective defense removes answer information from internal model states.
arXiv Detail & Related papers (2025-05-01T01:54:00Z)
Extracting Unlearned Information from LLMs with Activation Steering [46.16882599881247]
Unlearning has emerged as a solution to remove sensitive knowledge from models after training. We propose activation steering as a method for exact information retrieval from unlearned models. Our results demonstrate that exact information retrieval from unlearned models is possible, highlighting a severe vulnerability of current unlearning techniques.
arXiv Detail & Related papers (2024-11-04T21:42:56Z)
Catastrophic Failure of LLM Unlearning via Quantization [36.524827594501495]
We show that applying quantization to models that have undergone unlearning can restore the "forgotten" information.<n>We find that for unlearning methods with utility constraints, the unlearned model retains an average of 21% of the intended forgotten knowledge in full precision.
arXiv Detail & Related papers (2024-10-21T19:28:37Z)
A Closer Look at Machine Unlearning for Large Language Models [46.245404272612795]
Large language models (LLMs) may memorize sensitive or copyrighted content, raising privacy and legal concerns.<n>We discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches.
arXiv Detail & Related papers (2024-10-10T16:56:05Z)
Multimodal Unlearnable Examples: Protecting Data against Multimodal Contrastive Learning [53.766434746801366]
Multimodal contrastive learning (MCL) has shown remarkable advances in zero-shot classification by learning from millions of image-caption pairs crawled from the Internet. Hackers may unauthorizedly exploit image-text data for model training, potentially including personal and privacy-sensitive information. Recent works propose generating unlearnable examples by adding imperceptible perturbations to training images to build shortcuts for protection. We propose Multi-step Error Minimization (MEM), a novel optimization process for generating multimodal unlearnable examples.
arXiv Detail & Related papers (2024-07-23T09:00:52Z)
From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks [85.84979847888157]
Large Language Models (LLMs) are known to be vulnerable to jailbreak attacks.<n>LLMs can implicitly unlearn harmful knowledge that was not explicitly introduced during the unlearning phase.<n>We empirically validate this phenomenon, which makes unlearning-based methods able to decrease the Attack Success Rate.
arXiv Detail & Related papers (2024-07-03T07:14:05Z)
Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning [37.061187080745654]
We show that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of $textitbenign relearning attacks.<n>With access to only a small and potentially loosely related set of data, we find that we can ''jog'' the memory of unlearned models to reverse the effects of unlearning.
arXiv Detail & Related papers (2024-06-19T09:03:21Z)
Offset Unlearning for Large Language Models [49.851093293780615]
delta-Unlearning is an offset unlearning framework for black-box LLMs.<n>We show that delta-Unlearning can effectively unlearn target data while maintaining similar or even stronger performance on general out-of-forget-scope tasks.
arXiv Detail & Related papers (2024-04-17T03:39:51Z)
UNDIAL: Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models [12.45822383965784]
We introduce UnDIAL (Unlearning via Self-Distillation on Adjusted Logits), a novel and robust unlearning method. Our approach leverages self-distillation to adjust logits and selectively reduce the influence of targeted tokens.
arXiv Detail & Related papers (2024-02-15T16:21:14Z)
Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models. By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.