Related papers: Identity Inference from CLIP Models using Only Textual Data

Identity Inference from CLIP Models using Only Textual Data

URL: http://arxiv.org/abs/2405.14517v1
Date: Thu, 23 May 2024 12:54:25 GMT
Title: Identity Inference from CLIP Models using Only Textual Data
Authors: Songze Li, Ruoxi Cheng, Xiaojun Jia,
Abstract summary: Existing methods for identity inference in CLIP models require querying the model with full PII. Traditional membership inference attacks (MIAs) train shadow models to mimic the behaviors of the target model. We propose a textual unimodal detector (TUNI) in CLIP models, a novel method for ID inference that queries the target model with only text data.
Score: 12.497110441765274
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The widespread usage of large-scale multimodal models like CLIP has heightened concerns about the leakage of personally identifiable information (PII). Existing methods for identity inference in CLIP models, i.e., to detect the presence of a person's PII used for training a CLIP model, require querying the model with full PII, including textual descriptions of the person and corresponding images (e.g., the name and the face photo of the person). However, this may lead to potential privacy breach of the image, as it may have not been seen by the target model yet. Additionally, traditional membership inference attacks (MIAs) train shadow models to mimic the behaviors of the target model, which incurs high computational costs, especially for large CLIP models. To address these challenges, we propose a textual unimodal detector (TUNI) in CLIP models, a novel method for ID inference that 1) queries the target model with only text data; and 2) does not require training shadow models. Firstly, we develop a feature extraction algorithm, guided by the CLIP model, to extract features from a text description. TUNI starts with randomly generating textual gibberish that were clearly not utilized for training, and leverages their feature vectors to train a system of anomaly detectors. During inference, the feature vector of each test text is fed into the anomaly detectors to determine if the person's PII is in the training set (abnormal) or not (normal). Moreover, TUNI can be further strengthened integrating real images associated with the tested individuals, if available at the detector. Extensive experiments of TUNI across various CLIP model architectures and datasets demonstrate its superior performance over baselines, albeit with only text data.

Related papers

Holmes: Towards Effective and Harmless Model Ownership Verification to Personalized Large Vision Models via Decoupling Common Features [54.63343151319368]
This paper proposes a harmless model ownership verification method for personalized models by decoupling similar common features.<n>In the first stage, we create shadow models that retain common features of the victim model while disrupting dataset-specific features.<n>After that, a meta-classifier is trained to identify stolen models by determining whether suspicious models contain the dataset-specific features of the victim.
arXiv Detail & Related papers (2025-06-24T15:40:11Z)
Using Knowledge Graphs to harvest datasets for efficient CLIP model training [24.977076416143593]
Training high-quality CLIP models typically requires enormous datasets.<n>We show that by employing smart web search strategies enhanced with knowledge graphs, a robust CLIP model can be trained from scratch with considerably less data.
arXiv Detail & Related papers (2025-05-05T15:56:25Z)
You Only Submit One Image to Find the Most Suitable Generative Model [48.67303250592189]
We propose a novel setting called Generative Model Identification (GMI) GMI aims to enable the user to identify the most appropriate generative model(s) for the user's requirements efficiently.
arXiv Detail & Related papers (2024-12-16T14:46:57Z)
Gibberish is All You Need for Membership Inference Detection in Contrastive Language-Audio Pretraining [3.7144455366570055]
Existing MIAs need audio as input, risking exposure of voiceprint and requiring costly shadow models. We first propose PRMID, a membership inference detector based probability ranking given by CLAP, which does not require training shadow models. We then propose USMID, a textual unimodal speaker-level membership inference detector, querying the target model using only text data.
arXiv Detail & Related papers (2024-10-24T02:26:57Z)
Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling. Our research explores task-specific model pruning to inform decisions about designing SMoE architectures. We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z)
ComKD-CLIP: Comprehensive Knowledge Distillation for Contrastive Language-Image Pre-traning Model [49.587821411012705]
We propose ComKD-CLIP: Comprehensive Knowledge Distillation for Contrastive Language-Image Pre-traning Model. It distills the knowledge from a large teacher CLIP model into a smaller student model, ensuring comparable performance with significantly reduced parameters. EduAttention explores the cross-relationships between text features extracted by the teacher model and image features extracted by the student model.
arXiv Detail & Related papers (2024-08-08T01:12:21Z)
Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images. We identify model weaknesses by testing the model using the counterfactual image dataset. We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z)
Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. We train a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z)
Multimodal CLIP Inference for Meta-Few-Shot Image Classification [0.0]
Multimodal foundation models like CLIP learn a joint (image, text) embedding. This study demonstrates that combining modalities from CLIP's text and image encoders outperforms state-of-the-art meta-few-shot learners on widely adopted benchmarks.
arXiv Detail & Related papers (2024-03-26T17:47:54Z)
A Sober Look at the Robustness of CLIPs to Spurious Features [45.87070442259975]
We create a new dataset named CounterAnimal to reveal the reliance of CLIP models on realistic spurious features. Our evaluations show that the spurious features captured by CounterAnimal are generically learned by CLIP models with different backbones and pre-train data, yet have limited influence for ImageNet models.
arXiv Detail & Related papers (2024-03-18T06:04:02Z)
Is my Data in your AI Model? Membership Inference Test with Application to Face Images [18.402616111394842]
This article introduces the Membership Inference Test (MINT), a novel approach that aims to empirically assess if given data was used during the training of AI/ML models. We propose two MINT architectures designed to learn the distinct activation patterns that emerge when an Audited Model is exposed to data used during its training process. Experiments are carried out using six publicly available databases, comprising over 22 million face images in total.
arXiv Detail & Related papers (2024-02-14T15:09:01Z)
Sequential Modeling Enables Scalable Learning for Large Vision Models [120.91839619284431]
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
arXiv Detail & Related papers (2023-12-01T18:59:57Z)
Training CLIP models on Data from Scientific Papers [0.0]
Contrastive Language-Image Pretraining (CLIP) models are able to capture the semantic relationship of images and texts. This paper explores whether limited amounts higher quality data in a specific domain improve the general performance of CLIP models.
arXiv Detail & Related papers (2023-11-08T14:38:10Z)
Random Word Data Augmentation with CLIP for Zero-Shot Anomaly Detection [3.75292409381511]
This paper presents a novel method that leverages a visual-language model, CLIP, as a data source for zero-shot anomaly detection. Using the generated embeddings as training data, a feed-forward neural network learns to extract features of normal and anomaly from CLIP's embeddings. Experimental results demonstrate that our method achieves state-of-the-art performance without laborious prompt ensembling in zero-shot setups.
arXiv Detail & Related papers (2023-08-22T01:55:03Z)
Adapting Contrastive Language-Image Pretrained (CLIP) Models for Out-of-Distribution Detection [1.597617022056624]
We present a comprehensive experimental study on pretrained feature extractors for visual out-of-distribution (OOD) detection. We propose a new simple and scalable method called textitpseudo-label probing (PLP) that adapts vision-language models for OOD detection.
arXiv Detail & Related papers (2023-03-10T10:02:18Z)
Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP) We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
A Unified Understanding of Deep NLP Models for Text Classification [88.35418976241057]
We have developed a visual analysis tool, DeepNLPVis, to enable a unified understanding of NLP models for text classification. The key idea is a mutual information-based measure, which provides quantitative explanations on how each layer of a model maintains the information of input words in a sample. A multi-level visualization, which consists of a corpus-level, a sample-level, and a word-level visualization, supports the analysis from the overall training set to individual samples.
arXiv Detail & Related papers (2022-06-19T08:55:07Z)
DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP. Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models. Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z)
Unsupervised Noisy Tracklet Person Re-identification [100.85530419892333]
We present a novel selective tracklet learning (STL) approach that can train discriminative person re-id models from unlabelled tracklet data. This avoids the tedious and costly process of exhaustively labelling person image/tracklet true matching pairs across camera views. Our method is particularly more robust against arbitrary noisy data of raw tracklets therefore scalable to learning discriminative models from unconstrained tracking data.
arXiv Detail & Related papers (2021-01-16T07:31:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.