Transformer-Based Speech Synthesizer Attribution in an Open Set Scenario
- URL: http://arxiv.org/abs/2210.07546v1
- Date: Fri, 14 Oct 2022 05:55:21 GMT
- Title: Transformer-Based Speech Synthesizer Attribution in an Open Set Scenario
- Authors: Emily R. Bartusiak, Edward J. Delp
- Abstract summary: Speech synthesis methods can create realistic-sounding speech, which may be used for fraud, spoofing, and misinformation campaigns.
Forensic attribution methods identify the specific speech synthesis method used to create a speech signal.
We propose a speech attribution method that generalizes to new synthesizers not seen during training.
- Score: 16.93803259128475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech synthesis methods can create realistic-sounding speech, which may be
used for fraud, spoofing, and misinformation campaigns. Forensic methods that
detect synthesized speech are important for protection against such attacks.
Forensic attribution methods provide even more information about the nature of
synthesized speech signals because they identify the specific speech synthesis
method (i.e., speech synthesizer) used to create a speech signal. Due to the
increasing number of realistic-sounding speech synthesizers, we propose a
speech attribution method that generalizes to new synthesizers not seen during
training. To do so, we investigate speech synthesizer attribution in both a
closed set scenario and an open set scenario. In other words, we consider some
speech synthesizers to be "known" synthesizers (i.e., part of the closed set)
and others to be "unknown" synthesizers (i.e., part of the open set). We
represent speech signals as spectrograms and train our proposed method, known
as compact attribution transformer (CAT), on the closed set for multi-class
classification. Then, we extend our analysis to the open set to attribute
synthesized speech signals to both known and unknown synthesizers. We utilize a
t-distributed stochastic neighbor embedding (tSNE) on the latent space of the
trained CAT to differentiate between each unknown synthesizer. Additionally, we
explore poly-1 loss formulations to improve attribution results. Our proposed
approach successfully attributes synthesized speech signals to their respective
speech synthesizers in both closed and open set scenarios.
Related papers
- Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation [52.0893266767733]
We propose a robust deepfake speech detection method that employs feature decomposition to learn synthesizer-independent content features.
To enhance the model's robustness to different synthesizer characteristics, we propose a synthesizer feature augmentation strategy.
arXiv Detail & Related papers (2024-11-14T03:57:21Z) - HierSpeech++: Bridging the Gap between Semantic and Acoustic
Representation of Speech by Hierarchical Variational Inference for Zero-shot
Speech Synthesis [39.892633589217326]
Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis.
This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC)
arXiv Detail & Related papers (2023-11-21T09:07:11Z) - Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown
Multi-Class Ensemble of CNNs [1.262949092134022]
Novel strategy is proposed to attribute a synthetic speech track to the generator that is used to synthesize it.
The proposed detector transforms the audio into log-mel spectrogram, extracts features using CNN, and classifies it between five known and unknown algorithms.
The method outperforms other top teams in accuracy by 12-13% on Eval 2 and 1-2% on Eval 1, in the IEEE SP Cup challenge at ICASSP 2022.
arXiv Detail & Related papers (2023-09-15T04:26:39Z) - Audio-visual video-to-speech synthesis with synthesized input audio [64.86087257004883]
We investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference.
In particular, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech.
arXiv Detail & Related papers (2023-07-31T11:39:05Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - DSVAE: Interpretable Disentangled Representation for Synthetic Speech
Detection [25.451749986565375]
We propose Dis Spectrogram Variational Autoentangle (DSVAE) to generate interpretable representations of a speech signal for detecting synthetic speech.
Our experimental results show high accuracy (>98%) on detecting synthetic speech from 6 known and 10 out of 11 unknown speech synthesizers.
arXiv Detail & Related papers (2023-04-06T18:37:26Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Synthesized Speech Detection Using Convolutional Transformer-Based
Spectrogram Analysis [16.93803259128475]
Synthesized speech can be used for nefarious purposes, including creating a purported speech signal and attributing it to someone who did not speak the content of the signal.
In this paper, we analyze speech signals in the form of spectrograms with a Compact Convolutional Transformer for synthesized speech detection.
arXiv Detail & Related papers (2022-05-03T22:05:35Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Speech Resynthesis from Discrete Disentangled Self-Supervised
Representations [49.48053138928408]
We propose using self-supervised discrete representations for the task of speech resynthesis.
We extract low-bitrate representations for speech content, prosodic information, and speaker identity.
Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
arXiv Detail & Related papers (2021-04-01T09:20:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.