Identity Inference from CLIP Models using Only Textual Data
- URL: http://arxiv.org/abs/2405.14517v1
- Date: Thu, 23 May 2024 12:54:25 GMT
- Title: Identity Inference from CLIP Models using Only Textual Data
- Authors: Songze Li, Ruoxi Cheng, Xiaojun Jia,
- Abstract summary: Existing methods for identity inference in CLIP models require querying the model with full PII.
Traditional membership inference attacks (MIAs) train shadow models to mimic the behaviors of the target model.
We propose a textual unimodal detector (TUNI) in CLIP models, a novel method for ID inference that queries the target model with only text data.
- Score: 12.497110441765274
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The widespread usage of large-scale multimodal models like CLIP has heightened concerns about the leakage of personally identifiable information (PII). Existing methods for identity inference in CLIP models, i.e., to detect the presence of a person's PII used for training a CLIP model, require querying the model with full PII, including textual descriptions of the person and corresponding images (e.g., the name and the face photo of the person). However, this may lead to potential privacy breach of the image, as it may have not been seen by the target model yet. Additionally, traditional membership inference attacks (MIAs) train shadow models to mimic the behaviors of the target model, which incurs high computational costs, especially for large CLIP models. To address these challenges, we propose a textual unimodal detector (TUNI) in CLIP models, a novel method for ID inference that 1) queries the target model with only text data; and 2) does not require training shadow models. Firstly, we develop a feature extraction algorithm, guided by the CLIP model, to extract features from a text description. TUNI starts with randomly generating textual gibberish that were clearly not utilized for training, and leverages their feature vectors to train a system of anomaly detectors. During inference, the feature vector of each test text is fed into the anomaly detectors to determine if the person's PII is in the training set (abnormal) or not (normal). Moreover, TUNI can be further strengthened integrating real images associated with the tested individuals, if available at the detector. Extensive experiments of TUNI across various CLIP model architectures and datasets demonstrate its superior performance over baselines, albeit with only text data.
Related papers
- Gibberish is All You Need for Membership Inference Detection in Contrastive Language-Audio Pretraining [3.7144455366570055]
Existing MIAs need audio as input, risking exposure of voiceprint and requiring costly shadow models.
We first propose PRMID, a membership inference detector based probability ranking given by CLAP, which does not require training shadow models.
We then propose USMID, a textual unimodal speaker-level membership inference detector, querying the target model using only text data.
arXiv Detail & Related papers (2024-10-24T02:26:57Z) - Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images.
We identify model weaknesses by testing the model using the counterfactual image dataset.
We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z) - Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - A Sober Look at the Robustness of CLIPs to Spurious Features [45.87070442259975]
We create a new dataset named CounterAnimal to reveal the reliance of CLIP models on realistic spurious features.
Our evaluations show that the spurious features captured by CounterAnimal are generically learned by CLIP models with different backbones and pre-train data, yet have limited influence for ImageNet models.
arXiv Detail & Related papers (2024-03-18T06:04:02Z) - Is my Data in your AI Model? Membership Inference Test with Application to Face Images [18.402616111394842]
This article introduces the Membership Inference Test (MINT), a novel approach that aims to empirically assess if given data was used during the training of AI/ML models.
We propose two MINT architectures designed to learn the distinct activation patterns that emerge when an Audited Model is exposed to data used during its training process.
Experiments are carried out using six publicly available databases, comprising over 22 million face images in total.
arXiv Detail & Related papers (2024-02-14T15:09:01Z) - Random Word Data Augmentation with CLIP for Zero-Shot Anomaly Detection [3.75292409381511]
This paper presents a novel method that leverages a visual-language model, CLIP, as a data source for zero-shot anomaly detection.
Using the generated embeddings as training data, a feed-forward neural network learns to extract features of normal and anomaly from CLIP's embeddings.
Experimental results demonstrate that our method achieves state-of-the-art performance without laborious prompt ensembling in zero-shot setups.
arXiv Detail & Related papers (2023-08-22T01:55:03Z) - Adapting Contrastive Language-Image Pretrained (CLIP) Models for
Out-of-Distribution Detection [1.597617022056624]
We present a comprehensive experimental study on pretrained feature extractors for visual out-of-distribution (OOD) detection.
We propose a new simple and scalable method called textitpseudo-label probing (PLP) that adapts vision-language models for OOD detection.
arXiv Detail & Related papers (2023-03-10T10:02:18Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - A Unified Understanding of Deep NLP Models for Text Classification [88.35418976241057]
We have developed a visual analysis tool, DeepNLPVis, to enable a unified understanding of NLP models for text classification.
The key idea is a mutual information-based measure, which provides quantitative explanations on how each layer of a model maintains the information of input words in a sample.
A multi-level visualization, which consists of a corpus-level, a sample-level, and a word-level visualization, supports the analysis from the overall training set to individual samples.
arXiv Detail & Related papers (2022-06-19T08:55:07Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Unsupervised Noisy Tracklet Person Re-identification [100.85530419892333]
We present a novel selective tracklet learning (STL) approach that can train discriminative person re-id models from unlabelled tracklet data.
This avoids the tedious and costly process of exhaustively labelling person image/tracklet true matching pairs across camera views.
Our method is particularly more robust against arbitrary noisy data of raw tracklets therefore scalable to learning discriminative models from unconstrained tracking data.
arXiv Detail & Related papers (2021-01-16T07:31:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.