Compression Robust Synthetic Speech Detection Using Patched Spectrogram
Transformer
- URL: http://arxiv.org/abs/2402.14205v1
- Date: Thu, 22 Feb 2024 01:18:55 GMT
- Title: Compression Robust Synthetic Speech Detection Using Patched Spectrogram
Transformer
- Authors: Amit Kumar Singh Yadav, Ziyue Xiang, Kratika Bhagtani, Paolo
Bestagini, Stefano Tubaro, Edward J. Delp
- Abstract summary: We propose Patched Spectrogram Synthetic Speech Detection Transformer (PS3DT)
PS3DT is a synthetic speech detector that converts a time domain speech signal to a mel-spectrogram and processes it in patches using a transformer neural network.
We evaluate the detection performance of PS3DT on ASVspoof 2019 dataset.
- Score: 22.538895728224386
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many deep learning synthetic speech generation tools are readily available.
The use of synthetic speech has caused financial fraud, impersonation of
people, and misinformation to spread. For this reason forensic methods that can
detect synthetic speech have been proposed. Existing methods often overfit on
one dataset and their performance reduces substantially in practical scenarios
such as detecting synthetic speech shared on social platforms. In this paper we
propose, Patched Spectrogram Synthetic Speech Detection Transformer (PS3DT), a
synthetic speech detector that converts a time domain speech signal to a
mel-spectrogram and processes it in patches using a transformer neural network.
We evaluate the detection performance of PS3DT on ASVspoof2019 dataset. Our
experiments show that PS3DT performs well on ASVspoof2019 dataset compared to
other approaches using spectrogram for synthetic speech detection. We also
investigate generalization performance of PS3DT on In-the-Wild dataset. PS3DT
generalizes well than several existing methods on detecting synthetic speech
from an out-of-distribution dataset. We also evaluate robustness of PS3DT to
detect telephone quality synthetic speech and synthetic speech shared on social
platforms (compressed speech). PS3DT is robust to compression and can detect
telephone quality synthetic speech better than several existing methods.
Related papers
- Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation [52.0893266767733]
We propose a robust deepfake speech detection method that employs feature decomposition to learn synthesizer-independent content features.
To enhance the model's robustness to different synthesizer characteristics, we propose a synthesizer feature augmentation strategy.
arXiv Detail & Related papers (2024-11-14T03:57:21Z) - SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - DiffSSD: A Diffusion-Based Dataset For Speech Forensics [15.919164272315227]
Diffusion-based speech generators are ubiquitous. These methods can generate very high quality synthetic speech.
To counter such misuse, synthetic speech detectors have been developed.
Many of these detectors are trained on datasets which do not include diffusion-based synthesizers.
arXiv Detail & Related papers (2024-09-19T18:55:13Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - DSVAE: Interpretable Disentangled Representation for Synthetic Speech
Detection [25.451749986565375]
We propose Dis Spectrogram Variational Autoentangle (DSVAE) to generate interpretable representations of a speech signal for detecting synthetic speech.
Our experimental results show high accuracy (>98%) on detecting synthetic speech from 6 known and 10 out of 11 unknown speech synthesizers.
arXiv Detail & Related papers (2023-04-06T18:37:26Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Transformer-Based Speech Synthesizer Attribution in an Open Set Scenario [16.93803259128475]
Speech synthesis methods can create realistic-sounding speech, which may be used for fraud, spoofing, and misinformation campaigns.
Forensic attribution methods identify the specific speech synthesis method used to create a speech signal.
We propose a speech attribution method that generalizes to new synthesizers not seen during training.
arXiv Detail & Related papers (2022-10-14T05:55:21Z) - Deepfake audio detection by speaker verification [79.99653758293277]
We propose a new detection approach that leverages only the biometric characteristics of the speaker, with no reference to specific manipulations.
The proposed approach can be implemented based on off-the-shelf speaker verification tools.
We test several such solutions on three popular test sets, obtaining good performance, high generalization ability, and high robustness to audio impairment.
arXiv Detail & Related papers (2022-09-28T13:46:29Z) - Synthesized Speech Detection Using Convolutional Transformer-Based
Spectrogram Analysis [16.93803259128475]
Synthesized speech can be used for nefarious purposes, including creating a purported speech signal and attributing it to someone who did not speak the content of the signal.
In this paper, we analyze speech signals in the form of spectrograms with a Compact Convolutional Transformer for synthesized speech detection.
arXiv Detail & Related papers (2022-05-03T22:05:35Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Detection of AI-Synthesized Speech Using Cepstral & Bispectral
Statistics [0.0]
We propose an approach to distinguish human speech from AI synthesized speech.
Higher-order statistics have less correlation for human speech in comparison to a synthesized speech.
Also, Cepstral analysis revealed a durable power component in human speech that is missing for a synthesized speech.
arXiv Detail & Related papers (2020-09-03T21:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.