EmoAttack: Utilizing Emotional Voice Conversion for Speech Backdoor Attacks on Deep Speech Classification Models
- URL: http://arxiv.org/abs/2408.15508v2
- Date: Fri, 6 Sep 2024 07:46:30 GMT
- Title: EmoAttack: Utilizing Emotional Voice Conversion for Speech Backdoor Attacks on Deep Speech Classification Models
- Authors: Wenhan Yao, Zedong XingXiarun Chen, Jia Liu, yongqiang He, Weiping Wen,
- Abstract summary: Speech backdoor attacks can strategically focus on emotion, a higher-level subjective perceptual attribute inherent in speech.
EmoAttack method owns impactful trigger effectiveness and its remarkable attack success rate and accuracy variance.
- Score: 4.164975438207411
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Deep speech classification tasks, mainly including keyword spotting and speaker verification, play a crucial role in speech-based human-computer interaction. Recently, the security of these technologies has been demonstrated to be vulnerable to backdoor attacks. Specifically speaking, speech samples are attacked by noisy disruption and component modification in present triggers. We suggest that speech backdoor attacks can strategically focus on emotion, a higher-level subjective perceptual attribute inherent in speech. Furthermore, we proposed that emotional voice conversion technology can serve as the speech backdoor attack trigger, and the method is called EmoAttack. Based on this, we conducted attack experiments on two speech classification tasks, showcasing that EmoAttack method owns impactful trigger effectiveness and its remarkable attack success rate and accuracy variance. Additionally, the ablation experiments found that speech with intensive emotion is more suitable to be targeted for attacks.
Related papers
- SPBA: Utilizing Speech Large Language Model for Backdoor Attacks on Speech Classification Models [4.67675814519416]
Speech-based human-computer interaction is vulnerable to backdoor attacks.<n>In this paper, we propose that speech backdoor attacks can strategically focus on speech elements such as timbre and emotion.<n>The proposed attack is called the Speech Prompt Backdoor Attack (SPBA)
arXiv Detail & Related papers (2025-06-10T02:01:00Z) - Can DeepFake Speech be Reliably Detected? [17.10792531439146]
This work presents the first systematic study of active malicious attacks against state-of-the-art open-source speech detectors.
The results highlight the urgent need for more robust detection methods in the face of evolving adversarial threats.
arXiv Detail & Related papers (2024-10-09T06:13:48Z) - STAA-Net: A Sparse and Transferable Adversarial Attack for Speech
Emotion Recognition [36.73727306933382]
We propose a generator-based attack method to generate sparse and transferable adversarial examples to deceive SER models.
We evaluate our method on two widely-used SER datasets, Database of Elicited Mood in Speech (DEMoS) and Interactive Emotional dyadic MOtion CAPture (IEMOCAP)
arXiv Detail & Related papers (2024-02-02T08:46:57Z) - Revealing Emotional Clusters in Speaker Embeddings: A Contrastive
Learning Strategy for Speech Emotion Recognition [27.098672790099304]
It has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization.
Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters.
We introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition.
arXiv Detail & Related papers (2024-01-19T20:31:53Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Fake the Real: Backdoor Attack on Deep Speech Classification via Voice
Conversion [14.264424889358208]
This work explores a backdoor attack that utilizes sample-specific triggers based on voice conversion.
Specifically, we adopt a pre-trained voice conversion model to generate the trigger, ensuring that the poisoned samples does not introduce any additional audible noise.
arXiv Detail & Related papers (2023-06-28T02:19:31Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - Emotion Selectable End-to-End Text-based Speech Editing [63.346825713704625]
Emo-CampNet (emotion CampNet) is an emotion-selectable text-based speech editing model.
It can effectively control the emotion of the generated speech in the process of text-based speech editing.
It can also edit unseen speakers' speech.
arXiv Detail & Related papers (2022-12-20T12:02:40Z) - Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on
Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.
Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z) - Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.
In this paper, we aim to explicitly characterize and control the intensity of emotion.
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z) - Speaker Attentive Speech Emotion Recognition [11.92436948211501]
Speech Emotion Recognition (SER) task has known significant improvements over the last years with the advent of Deep Neural Networks (DNNs)
We present novel work based on the idea of teaching the emotion recognition network about speaker identity.
arXiv Detail & Related papers (2021-04-15T07:59:37Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.