Synthesized Speech Detection Using Convolutional Transformer-Based
Spectrogram Analysis
- URL: http://arxiv.org/abs/2205.01800v1
- Date: Tue, 3 May 2022 22:05:35 GMT
- Title: Synthesized Speech Detection Using Convolutional Transformer-Based
Spectrogram Analysis
- Authors: Emily R. Bartusiak, Edward J. Delp
- Abstract summary: Synthesized speech can be used for nefarious purposes, including creating a purported speech signal and attributing it to someone who did not speak the content of the signal.
In this paper, we analyze speech signals in the form of spectrograms with a Compact Convolutional Transformer for synthesized speech detection.
- Score: 16.93803259128475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthesized speech is common today due to the prevalence of virtual
assistants, easy-to-use tools for generating and modifying speech signals, and
remote work practices. Synthesized speech can also be used for nefarious
purposes, including creating a purported speech signal and attributing it to
someone who did not speak the content of the signal. We need methods to detect
if a speech signal is synthesized. In this paper, we analyze speech signals in
the form of spectrograms with a Compact Convolutional Transformer (CCT) for
synthesized speech detection. A CCT utilizes a convolutional layer that
introduces inductive biases and shared weights into a network, allowing a
transformer architecture to perform well with fewer data samples used for
training. The CCT uses an attention mechanism to incorporate information from
all parts of a signal under analysis. Trained on both genuine human voice
signals and synthesized human voice signals, we demonstrate that our CCT
approach successfully differentiates between genuine and synthesized speech
signals.
Related papers
- Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - DSVAE: Interpretable Disentangled Representation for Synthetic Speech
Detection [25.451749986565375]
We propose Dis Spectrogram Variational Autoentangle (DSVAE) to generate interpretable representations of a speech signal for detecting synthetic speech.
Our experimental results show high accuracy (>98%) on detecting synthetic speech from 6 known and 10 out of 11 unknown speech synthesizers.
arXiv Detail & Related papers (2023-04-06T18:37:26Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - SPEAKER VGG CCT: Cross-corpus Speech Emotion Recognition with Speaker
Embedding and Vision Transformers [0.0]
This paper develops a new learning solution for Speech Emotion Recognition.
It is based on Compact Convolutional Transformers (CCTs) combined with a speaker embedding.
Experiments have been performed on several benchmarks in a cross-corpus setting.
arXiv Detail & Related papers (2022-11-04T10:49:44Z) - Transformer-Based Speech Synthesizer Attribution in an Open Set Scenario [16.93803259128475]
Speech synthesis methods can create realistic-sounding speech, which may be used for fraud, spoofing, and misinformation campaigns.
Forensic attribution methods identify the specific speech synthesis method used to create a speech signal.
We propose a speech attribution method that generalizes to new synthesizers not seen during training.
arXiv Detail & Related papers (2022-10-14T05:55:21Z) - Inner speech recognition through electroencephalographic signals [2.578242050187029]
This work focuses on inner speech recognition starting from EEG signals.
The decoding of the EEG into text should be understood as the classification of a limited number of words (commands)
Speech-related BCIs provide effective vocal communication strategies for controlling devices through speech commands interpreted from brain signals.
arXiv Detail & Related papers (2022-10-11T08:29:12Z) - Acoustic To Articulatory Speech Inversion Using Multi-Resolution
Spectro-Temporal Representations Of Speech Signals [5.743287315640403]
We train a feed-forward deep neural network to estimate articulatory trajectories of six tract variables.
Experiments achieved a correlation of 0.675 with ground-truth tract variables.
arXiv Detail & Related papers (2022-03-11T07:27:42Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Speech Resynthesis from Discrete Disentangled Self-Supervised
Representations [49.48053138928408]
We propose using self-supervised discrete representations for the task of speech resynthesis.
We extract low-bitrate representations for speech content, prosodic information, and speaker identity.
Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
arXiv Detail & Related papers (2021-04-01T09:20:33Z) - Discriminative Singular Spectrum Classifier with Applications on
Bioacoustic Signal Recognition [67.4171845020675]
We present a bioacoustic signal classifier equipped with a discriminative mechanism to extract useful features for analysis and classification efficiently.
Unlike current bioacoustic recognition methods, which are task-oriented, the proposed model relies on transforming the input signals into vector subspaces.
The validity of the proposed method is verified using three challenging bioacoustic datasets containing anuran, bee, and mosquito species.
arXiv Detail & Related papers (2021-03-18T11:01:21Z) - Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution.
We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.