Related papers: Visual-Guided Key-Token Regularization for Multimodal Large Language Model Unlearning

Visual-Guided Key-Token Regularization for Multimodal Large Language Model Unlearning

URL: http://arxiv.org/abs/2601.22020v1
Date: Thu, 29 Jan 2026 17:26:54 GMT
Title: Visual-Guided Key-Token Regularization for Multimodal Large Language Model Unlearning
Authors: Chengyi Cai, Zesheng Ye, Peike Li, Bo Han, Jianzhong Qi, Feng Liu,
Abstract summary: We propose Visual-Guided Key-Token Regularization (ViKeR)<n>We leverage irrelevant visual inputs to predict ideal post-unlearning token-level distributions.<n>Our method effectively performs unlearning while mitigating forgetting and maintaining response coherence.
Score: 39.211611292654176
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unlearning in Multimodal Large Language Models (MLLMs) prevents the model from revealing private information when queried about target images. Existing MLLM unlearning methods largely adopt approaches developed for LLMs. They treat all answer tokens uniformly, disregarding their varying importance in the unlearning process. Moreover, these methods focus exclusively on the language modality, disregarding visual cues that indicate key tokens in answers. In this paper, after formulating the problem of unlearning in multimodal question answering for MLLMs, we propose Visual-Guided Key-Token Regularization (ViKeR). We leverage irrelevant visual inputs to predict ideal post-unlearning token-level distributions and use these distributions to regularize the unlearning process, thereby prioritizing key tokens. Further, we define key tokens in unlearning via information entropy and discuss ViKeR's effectiveness through token-level gradient reweighting, which amplifies updates on key tokens. Experiments on MLLMU and CLEAR benchmarks demonstrate that our method effectively performs unlearning while mitigating forgetting and maintaining response coherence.

Related papers

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens [69.97021957331326]
We propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization.<n>We also introduce a VAE branch with linear projection to recover fine-grained image details.
arXiv Detail & Related papers (2025-12-02T09:02:20Z)
Cross-Modal Attention Guided Unlearning in Vision-Language Models [16.460281156521646]
Vision-Language Models (VLMs) have demonstrated immense capabilities in multi-modal understanding and inference tasks.<n>VLMs add a layer of complexity to this process, as the visual context in the query may also contain sensitive information in addition to the text.<n>We formulate Cross-Modal Attention Guided Unlearning (CAGUL), a lightweight and efficient VLM unlearning framework.
arXiv Detail & Related papers (2025-10-08T21:21:59Z)
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [66.04061083611863]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation.<n>We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE)<n>DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z)
FoPru: Focal Pruning for Efficient Large Vision-Language Models [11.36025001578531]
We propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder. Our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.
arXiv Detail & Related papers (2024-11-21T14:22:38Z)
Model Tells Itself Where to Attend: Faithfulness Meets Automatic Attention Steering [108.2131720470005]
Large language models (LLMs) have demonstrated remarkable performance across various real-world tasks. They often struggle to fully comprehend and effectively utilize their input contexts, resulting in responses that are unfaithful or hallucinated. We propose AutoPASTA, a method that automatically identifies key contextual information and explicitly highlights it by steering an LLM's attention scores.
arXiv Detail & Related papers (2024-09-16T23:52:41Z)
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [73.34709921061928]
We propose a training-free method to inject visual prompts into Multimodal Large Language Models (MLLMs)<n>We optimize a learnable latent variable based on an energy function, enhancing the strength of referring regions in the attention map.<n>Our method offers a promising direction for integrating referring abilities into MLLMs, and supports referring with box, mask, scribble and point.
arXiv Detail & Related papers (2024-07-31T11:40:29Z)
Soft Prompting for Unlearning in Large Language Models [11.504012974208466]
This work focuses on investigating machine unlearning for Large Language Models motivated by data protection regulations. We propose a framework textbfSoft textbfPrompting for textbfUntextbflearning (SPUL) We conduct a rigorous evaluation of the proposed method and our results indicate that SPUL can significantly improve the trade-off between utility and forgetting.
arXiv Detail & Related papers (2024-06-17T19:11:40Z)
Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training. We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training) The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.