Gibberish is All You Need for Membership Inference Detection in Contrastive Language-Audio Pretraining
- URL: http://arxiv.org/abs/2410.18371v2
- Date: Sat, 02 Nov 2024 10:00:52 GMT
- Title: Gibberish is All You Need for Membership Inference Detection in Contrastive Language-Audio Pretraining
- Authors: Ruoxi Cheng, Yizhong Ding, Shuirong Cao, Shitong Shao, Zhiqiang Wang,
- Abstract summary: Existing MIAs need audio as input, risking exposure of voiceprint and requiring costly shadow models.
We first propose PRMID, a membership inference detector based probability ranking given by CLAP, which does not require training shadow models.
We then propose USMID, a textual unimodal speaker-level membership inference detector, querying the target model using only text data.
- Score: 3.7144455366570055
- License:
- Abstract: Audio can disclose PII, particularly when combined with related text data. Therefore, it is essential to develop tools to detect privacy leakage in Contrastive Language-Audio Pretraining(CLAP). Existing MIAs need audio as input, risking exposure of voiceprint and requiring costly shadow models. We first propose PRMID, a membership inference detector based probability ranking given by CLAP, which does not require training shadow models but still requires both audio and text of the individual as input. To address these limitations, we then propose USMID, a textual unimodal speaker-level membership inference detector, querying the target model using only text data. We randomly generate textual gibberish that are clearly not in training dataset. Then we extract feature vectors from these texts using the CLAP model and train a set of anomaly detectors on them. During inference, the feature vector of each test text is input into the anomaly detector to determine if the speaker is in the training set (anomalous) or not (normal). If available, USMID can further enhance detection by integrating real audio of the tested speaker. Extensive experiments on various CLAP model architectures and datasets demonstrate that USMID outperforms baseline methods using only text data.
Related papers
- Identity Inference from CLIP Models using Only Textual Data [12.497110441765274]
Existing methods for identity inference in CLIP models require querying the model with full PII.
Traditional membership inference attacks (MIAs) train shadow models to mimic the behaviors of the target model.
We propose a textual unimodal detector (TUNI) in CLIP models, a novel method for ID inference that queries the target model with only text data.
arXiv Detail & Related papers (2024-05-23T12:54:25Z) - Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors.
In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z) - Introducing Model Inversion Attacks on Automatic Speaker Recognition [0.9558392439655015]
Model inversion (MI) attacks allow to reconstruct average per-class representations of a machine learning (ML) model's training data.
We present an approach to (1) reconstruct audio samples from a trained ML model and (2) extract intermediate voice feature representations which provide valuable insights into the speakers' biometrics.
Our sliding MI extends standard MI by iteratively inverting overlapping chunks of the audio samples.
We show that one can use the inverted audio data to generate spoofed audio samples to impersonate a speaker, and execute voice-protected commands for highly secured systems.
arXiv Detail & Related papers (2023-01-09T08:51:15Z) - CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled
Videos [44.14061539284888]
We propose to approach text-queried universal sound separation by using only unlabeled data.
The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model.
While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting.
arXiv Detail & Related papers (2022-12-14T07:21:45Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Wake Word Detection with Alignment-Free Lattice-Free MMI [66.12175350462263]
Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input.
We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data.
We evaluate our methods on two real data sets, showing 50%--90% reduction in false rejection rates at pre-specified false alarm rates over the best previously published figures.
arXiv Detail & Related papers (2020-05-17T19:22:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.