FoolHD: Fooling speaker identification by Highly imperceptible
adversarial Disturbances
- URL: http://arxiv.org/abs/2011.08483v2
- Date: Sat, 20 Feb 2021 12:15:25 GMT
- Title: FoolHD: Fooling speaker identification by Highly imperceptible
adversarial Disturbances
- Authors: Ali Shahin Shamsabadi, Francisco Sep\'ulveda Teixeira, Alberto Abad,
Bhiksha Raj, Andrea Cavallaro, Isabel Trancoso
- Abstract summary: We propose a white-box steganography-inspired adversarial attack that generates imperceptible perturbations against a speaker identification model.
Our approach, FoolHD, uses a Gated Convolutional Autoencoder that operates in the DCT domain and is trained with a multi-objective loss function.
We validate FoolHD with a 250-speaker identification x-vector network, trained using VoxCeleb, in terms of accuracy, success rate, and imperceptibility.
- Score: 63.80959552818541
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker identification models are vulnerable to carefully designed
adversarial perturbations of their input signals that induce misclassification.
In this work, we propose a white-box steganography-inspired adversarial attack
that generates imperceptible adversarial perturbations against a speaker
identification model. Our approach, FoolHD, uses a Gated Convolutional
Autoencoder that operates in the DCT domain and is trained with a
multi-objective loss function, in order to generate and conceal the adversarial
perturbation within the original audio files. In addition to hindering speaker
identification performance, this multi-objective loss accounts for human
perception through a frame-wise cosine similarity between MFCC feature vectors
extracted from the original and adversarial audio files. We validate the
effectiveness of FoolHD with a 250-speaker identification x-vector network,
trained using VoxCeleb, in terms of accuracy, success rate, and
imperceptibility. Our results show that FoolHD generates highly imperceptible
adversarial audio files (average PESQ scores above 4.30), while achieving a
success rate of 99.6% and 99.2% in misleading the speaker identification model,
for untargeted and targeted settings, respectively.
Related papers
- What to Remember: Self-Adaptive Continual Learning for Audio Deepfake
Detection [53.063161380423715]
Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types.
We propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection.
arXiv Detail & Related papers (2023-12-15T09:52:17Z) - Meta-Learning Framework for End-to-End Imposter Identification in Unseen
Speaker Recognition [4.143603294943441]
We show the problem of generalization using fixed thresholds (computed using EER metric) for imposter identification in unseen speaker recognition.
We then introduce a robust speaker-specific thresholding technique for better performance.
We show the efficacy of the proposed techniques on VoxCeleb1, VCTK and the FFSVC 2022 datasets, beating the baselines by up to 10%.
arXiv Detail & Related papers (2023-06-01T17:49:58Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Symmetric Saliency-based Adversarial Attack To Speaker Identification [17.087523686496958]
We propose a novel generation-network-based approach, called symmetric saliency-based encoder-decoder (SSED)
First, it uses a novel saliency map decoder to learn the importance of speech samples to the decision of a targeted speaker identification system.
Second, it proposes an angular loss function to push the speaker embedding far away from the source speaker.
arXiv Detail & Related papers (2022-10-30T08:54:02Z) - Dictionary Attacks on Speaker Verification [15.00667613025837]
We introduce a generic formulation of the attack that can be used with various speech representations and threat models.
The attacker uses adversarial optimization to maximize raw similarity of speaker embeddings between a seed speech sample and a proxy population.
We show that, combined with multiple attempts, this attack opens even more to serious issues on the security of these systems.
arXiv Detail & Related papers (2022-04-24T15:31:41Z) - Attack on practical speaker verification system using universal
adversarial perturbations [20.38185341318529]
This work shows that by playing our crafted adversarial perturbation as a separate source when the adversary is speaking, the practical speaker verification system will misjudge the adversary as a target speaker.
A two-step algorithm is proposed to optimize the universal adversarial perturbation to be text-independent and has little effect on the authentication text recognition.
arXiv Detail & Related papers (2021-05-19T09:43:34Z) - Combating Adversaries with Anti-Adversaries [118.70141983415445]
In particular, our layer generates an input perturbation in the opposite direction of the adversarial one.
We verify the effectiveness of our approach by combining our layer with both nominally and robustly trained models.
Our anti-adversary layer significantly enhances model robustness while coming at no cost on clean accuracy.
arXiv Detail & Related papers (2021-03-26T09:36:59Z) - Multimodal Attention Fusion for Target Speaker Extraction [108.73502348754842]
We propose a novel attention mechanism for multi-modal fusion and its training methods.
Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data.
arXiv Detail & Related papers (2021-02-02T05:59:35Z) - Adversarially Training for Audio Classifiers [9.868221447090853]
We show that, the ResNet-56 model trained on the 2D representation of the discrete wavelet transform with the tonnetz chromagram outperforms other models in terms of recognition accuracy.
We run our experiments on two benchmarking environmental sound datasets and show that without any imposed limitations on the budget allocations for the adversary, the fooling rate of the adversarially trained models can exceed 90%.
arXiv Detail & Related papers (2020-08-26T15:15:32Z) - Temporal Sparse Adversarial Attack on Sequence-based Gait Recognition [56.844587127848854]
We demonstrate that the state-of-the-art gait recognition model is vulnerable to such attacks.
We employ a generative adversarial network based architecture to semantically generate adversarial high-quality gait silhouettes or video frames.
The experimental results show that if only one-fortieth of the frames are attacked, the accuracy of the target model drops dramatically.
arXiv Detail & Related papers (2020-02-22T10:08:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.