PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models
- URL: http://arxiv.org/abs/2503.12545v2
- Date: Tue, 22 Jul 2025 08:49:12 GMT
- Title: PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models
- Authors: Zhaopan Xu, Pengfei Zhou, Weidong Tang, Jiaxin Ai, Wangbo Zhao, Kai Wang, Xiaojiang Peng, Wenqi Shao, Hongxun Yao, Kaipeng Zhang,
- Abstract summary: Multimodal large language models (MLLMs) have achieved remarkable success in vision-language tasks.<n>Their reliance on vast, internet-sourced data raises significant privacy and security concerns.<n>Machine unlearning (MU) has emerged as a critical technique to address these issues.<n>PEBench is a novel benchmark designed to facilitate a thorough assessment of MU in MLLMs.
- Score: 27.338242898495448
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal large language models (MLLMs) have achieved remarkable success in vision-language tasks, but their reliance on vast, internet-sourced data raises significant privacy and security concerns. Machine unlearning (MU) has emerged as a critical technique to address these issues, enabling the selective removal of targeted information from pre-trained models without costly retraining. However, the evaluation of MU for MLLMs remains inadequate. Existing benchmarks often lack a comprehensive scope, focusing narrowly on entities while overlooking the unlearning of broader visual concepts and the inherent semantic coupling between them. To bridge this gap, we introduce, PEBench, a novel benchmark designed to facilitate a thorough assessment of MU in MLLMs. PEBench features a fictitious dataset of personal entities and corresponding event scenes to evaluate unlearning across these distinct yet entangled concepts. We leverage this benchmark to evaluate five MU methods, revealing their unique strengths and weaknesses. Our findings show that unlearning one concept can unintentionally degrade performance on related concepts within the same image, a challenge we term cross-concept interference. Furthermore, we demonstrate the difficulty of unlearning person and event concepts simultaneously and propose an effective method to mitigate these conflicting objectives. The source code and benchmark are publicly available at https://pebench.github.io.
Related papers
- Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting [70.83781268763215]
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training.<n>VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion.<n>This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems.
arXiv Detail & Related papers (2025-08-06T09:03:10Z) - Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation [88.78166077081912]
We introduce a multimodal unlearning benchmark, UnLOK-VQA, and an attack-and-defense framework to evaluate methods for deleting specific multimodal knowledge from MLLMs.<n>Our results show multimodal attacks outperform text- or image-only ones, and that the most effective defense removes answer information from internal model states.
arXiv Detail & Related papers (2025-05-01T01:54:00Z) - Survey of Adversarial Robustness in Multimodal Large Language Models [17.926240920647892]
Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance in artificial intelligence.
Their deployment in real-world applications raises significant concerns about adversarial vulnerabilities.
This paper reviews the adversarial robustness of MLLMs, covering different modalities.
arXiv Detail & Related papers (2025-03-18T06:54:59Z) - Grounded Chain-of-Thought for Multimodal Large Language Models [66.04061083611863]
We propose a new learning task for multimodal large language models (MLLMs) called Grounded Chain-of-Thought (GCoT)
GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis.
To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images.
arXiv Detail & Related papers (2025-03-17T04:07:47Z) - SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders [16.551943721248108]
We introduce SAUCE, a novel method for fine-grained and selective concept unlearning in vision-language models.<n>It first trains SAEs to capture high-dimensional, semantically rich sparse features.<n>It then identifies the features most relevant to the target concept for unlearning.<n>During inference, it selectively modifies these features to suppress specific concepts while preserving unrelated information.
arXiv Detail & Related papers (2025-03-16T17:32:23Z) - VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity [34.29409506366145]
VERIFY is a benchmark designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs.<n>Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes.<n>We propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns.
arXiv Detail & Related papers (2025-03-14T16:26:11Z) - EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents [63.43699771428243]
EmbodiedBench is an extensive benchmark designed to evaluate vision-driven embodied agents.
We evaluated 19 leading proprietary and open-source MLLMs within EmbodiedBench.
MLLMs excel at high-level tasks but struggle with low-level manipulation.
arXiv Detail & Related papers (2025-02-13T18:11:34Z) - Benchmarking Large and Small MLLMs [71.78055760441256]
Large multimodal language models (MLLMs) have achieved remarkable advancements in understanding and generating multimodal content.<n>However, their deployment faces significant challenges, including slow inference, high computational cost, and impracticality for on-device applications.<n>Small MLLMs, exemplified by the LLava-series models and Phi-3-Vision, offer promising alternatives with faster inference, reduced deployment costs, and the ability to handle domain-specific scenarios.
arXiv Detail & Related papers (2025-01-04T07:44:49Z) - COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training [49.2684130383925]
We propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training.<n>It integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework.<n>It consistently outperforms previous strong baselines on various zero-shot downstream tasks.
arXiv Detail & Related papers (2024-12-02T18:56:06Z) - Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset [92.99416966226724]
We introduce Facial Identity Unlearning Benchmark (FIUBench), a novel VLM unlearning benchmark designed to robustly evaluate the effectiveness of unlearning algorithms.<n>We apply a two-stage evaluation pipeline that is designed to precisely control the sources of information and their exposure levels.<n>Through the evaluation of four baseline VLM unlearning algorithms within FIUBench, we find that all methods remain limited in their unlearning performance.
arXiv Detail & Related papers (2024-11-05T23:26:10Z) - Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench [17.73279547506514]
We introduce Multimodal Large Language Model Unlearning Benchmark (MLLMU-Bench), a novel benchmark aimed at advancing the understanding of multimodal machine unlearning.<n>MLLMU-Bench consists of 500 fictitious profiles and 153 profiles for public celebrities, each profile feature over 14 customized question-answer pairs, evaluated from both multimodal (image+text) and unimodal (text) perspectives.<n>Surprisingly, our experiments show that unimodal unlearning algorithms excel in generation and cloze tasks, while multimodal unlearning approaches perform better in classification tasks with multimodal inputs.
arXiv Detail & Related papers (2024-10-29T15:07:23Z) - CLEAR: Character Unlearning in Textual and Visual Modalities [7.618793381903125]
multimodal unlearning (MMU) remains underexplored due to the lack of open benchmarks for evaluating cross-modal data removal.<n> CLEAR contains 200 fictitious individuals and 3,700 images linked with corresponding question-answer pairs.<n>We conduct a comprehensive analysis of 11 MU methods across four evaluation sets, demonstrating that jointly unlearning both modalities outperforms single-modality approaches.
arXiv Detail & Related papers (2024-10-23T17:30:50Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - A Closer Look at Machine Unlearning for Large Language Models [46.245404272612795]
Large language models (LLMs) may memorize sensitive or copyrighted content, raising privacy and legal concerns.<n>We discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches.
arXiv Detail & Related papers (2024-10-10T16:56:05Z) - MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning [25.45278447786954]
We introduce a novel federated learning framework, named Multimodal Large Language Model Assisted Federated Learning (MLLM-LLaVA-FL)
Our framework is adept at harnessing the extensive, yet previously underexploited, open-source data accessible from websites and powerful server-side computational resources.
arXiv Detail & Related papers (2024-09-09T21:04:16Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - MU-Bench: A Multitask Multimodal Benchmark for Machine Unlearning [14.755831733659699]
We develop MU-Bench, the first comprehensive benchmark for Machine Unlearning (MU)<n> MU-Bench unifies the sets of deleted samples and trained models, and provides broad coverage of tasks and data modalities.<n>We analyze several under-investigated aspects of unlearning, including scalability, the impacts of parameter-efficient fine-tuning and curriculum learning, and susceptibility to dataset biases.
arXiv Detail & Related papers (2024-06-21T00:13:17Z) - Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - MMRel: A Relation Understanding Benchmark in the MLLM Era [72.95901753186227]
Multi-Modal Relation Understanding (MMRel) is a benchmark that features large-scale, high-quality, and diverse data on inter-object relations.
MMRel is ideal for evaluating MLLMs on relation understanding, as well as for fine-tuning MLLMs to enhance relation comprehension capability.
arXiv Detail & Related papers (2024-06-13T13:51:59Z) - MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models [51.19622266249408]
MultiTrust is the first comprehensive and unified benchmark on the trustworthiness of MLLMs.<n>Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts.<n>Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks.
arXiv Detail & Related papers (2024-06-11T08:38:13Z) - Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models [16.886116549737956]
We propose an efficient method, Single Image Unlearning (SIU), to unlearn the visual recognition of a concept by fine-tuning a single associated image for few steps.<n> Experimental results on MMUBench show that SIU completely surpasses the performance of existing methods.
arXiv Detail & Related papers (2024-05-21T06:27:12Z) - Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models.
By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.