Multimodal Unlearnable Examples: Protecting Data against Multimodal Contrastive Learning
- URL: http://arxiv.org/abs/2407.16307v2
- Date: Fri, 26 Jul 2024 08:39:19 GMT
- Title: Multimodal Unlearnable Examples: Protecting Data against Multimodal Contrastive Learning
- Authors: Xinwei Liu, Xiaojun Jia, Yuan Xun, Siyuan Liang, Xiaochun Cao,
- Abstract summary: Multimodal contrastive learning (MCL) has shown remarkable advances in zero-shot classification by learning from millions of image-caption pairs crawled from the Internet.
Hackers may unauthorizedly exploit image-text data for model training, potentially including personal and privacy-sensitive information.
Recent works propose generating unlearnable examples by adding imperceptible perturbations to training images to build shortcuts for protection.
We propose Multi-step Error Minimization (MEM), a novel optimization process for generating multimodal unlearnable examples.
- Score: 53.766434746801366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal contrastive learning (MCL) has shown remarkable advances in zero-shot classification by learning from millions of image-caption pairs crawled from the Internet. However, this reliance poses privacy risks, as hackers may unauthorizedly exploit image-text data for model training, potentially including personal and privacy-sensitive information. Recent works propose generating unlearnable examples by adding imperceptible perturbations to training images to build shortcuts for protection. However, they are designed for unimodal classification, which remains largely unexplored in MCL. We first explore this context by evaluating the performance of existing methods on image-caption pairs, and they do not generalize effectively to multimodal data and exhibit limited impact to build shortcuts due to the lack of labels and the dispersion of pairs in MCL. In this paper, we propose Multi-step Error Minimization (MEM), a novel optimization process for generating multimodal unlearnable examples. It extends the Error-Minimization (EM) framework to optimize both image noise and an additional text trigger, thereby enlarging the optimized space and effectively misleading the model to learn the shortcut between the noise features and the text trigger. Specifically, we adopt projected gradient descent to solve the noise minimization problem and use HotFlip to approximate the gradient and replace words to find the optimal text trigger. Extensive experiments demonstrate the effectiveness of MEM, with post-protection retrieval results nearly half of random guessing, and its high transferability across different models. Our code is available on the https://github.com/thinwayliu/Multimodal-Unlearnable-Examples
Related papers
- MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework.
Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss.
We show that MMAR demonstrates much more superior performance than other joint multi-modal models.
arXiv Detail & Related papers (2024-10-14T17:57:18Z) - SCA: Highly Efficient Semantic-Consistent Unrestricted Adversarial Attack [29.744970741737376]
We propose a novel framework called Semantic-Consistent Unrestricted Adversarial Attacks (SCA)
SCA employs an inversion method to extract edit-friendly noise maps and utilizes Multimodal Large Language Model (MLLM) to provide semantic guidance.
Our framework enables the efficient generation of adversarial examples that exhibit minimal discernible semantic changes.
arXiv Detail & Related papers (2024-10-03T06:25:53Z) - Semantic Deep Hiding for Robust Unlearnable Examples [33.68037533119807]
Unlearnable examples are proposed to mislead the deep learning models and prevent data from unauthorized exploration.
We propose a Deep Hiding scheme that adaptively hides semantic images enriched with high-level features.
Our proposed method exhibits outstanding robustness for unlearnable examples, demonstrating its efficacy in preventing unauthorized data exploitation.
arXiv Detail & Related papers (2024-06-25T08:05:42Z) - Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models.
By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive
Learning [63.72975421109622]
CleanCLIP is a finetuning framework that weakens the learned spurious associations introduced by backdoor attacks.
CleanCLIP maintains model performance on benign examples while erasing a range of backdoor attacks on multimodal contrastive learning.
arXiv Detail & Related papers (2023-03-06T17:48:32Z) - VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix [59.25846149124199]
This paper proposes a data augmentation method, namely cross-modal CutMix.
CMC transforms natural sentences from the textual view into a multi-modal view.
By attaching cross-modal noise on uni-modal data, it guides models to learn token-level interactions across modalities for better denoising.
arXiv Detail & Related papers (2022-06-17T17:56:47Z) - Mitigating Modality Collapse in Multimodal VAEs via Impartial
Optimization [7.4262579052708535]
We argue that this effect is a consequence of conflicting gradients during multimodal VAE training.
We show how to detect the sub-graphs in the computational graphs where gradients conflict.
We empirically show that our framework significantly improves the reconstruction performance, conditional generation, and coherence of the latent space across modalities.
arXiv Detail & Related papers (2022-06-09T13:29:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.