DiffSSD: A Diffusion-Based Dataset For Speech Forensics
- URL: http://arxiv.org/abs/2409.13049v2
- Date: Wed, 2 Oct 2024 13:04:02 GMT
- Title: DiffSSD: A Diffusion-Based Dataset For Speech Forensics
- Authors: Kratika Bhagtani, Amit Kumar Singh Yadav, Paolo Bestagini, Edward J. Delp,
- Abstract summary: Diffusion-based speech generators are ubiquitous. These methods can generate very high quality synthetic speech.
To counter such misuse, synthetic speech detectors have been developed.
Many of these detectors are trained on datasets which do not include diffusion-based synthesizers.
- Score: 15.919164272315227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion-based speech generators are ubiquitous. These methods can generate very high quality synthetic speech and several recent incidents report their malicious use. To counter such misuse, synthetic speech detectors have been developed. Many of these detectors are trained on datasets which do not include diffusion-based synthesizers. In this paper, we demonstrate that existing detectors trained on one such dataset, ASVspoof2019, do not perform well in detecting synthetic speech from recent diffusion-based synthesizers. We propose the Diffusion-Based Synthetic Speech Dataset (DiffSSD), a dataset consisting of about 200 hours of labeled speech, including synthetic speech generated by 8 diffusion-based open-source and 2 commercial generators. We also examine the performance of existing synthetic speech detectors on DiffSSD in both closed-set and open-set scenarios. The results highlight the importance of this dataset in detecting synthetic speech generated from recent open-source and commercial speech generators.
Related papers
- FairSSD: Understanding Bias in Synthetic Speech Detectors [15.548402598331275]
We examine bias in existing synthetic speech detectors to determine if they will unfairly target a particular gender, age and accent group.
Experiments on 6 existing synthetic speech detectors using more than 0.9 million speech signals demonstrate that most detectors are gender, age and accent biased.
arXiv Detail & Related papers (2024-04-17T01:53:03Z) - Compression Robust Synthetic Speech Detection Using Patched Spectrogram
Transformer [22.538895728224386]
We propose Patched Spectrogram Synthetic Speech Detection Transformer (PS3DT)
PS3DT is a synthetic speech detector that converts a time domain speech signal to a mel-spectrogram and processes it in patches using a transformer neural network.
We evaluate the detection performance of PS3DT on ASVspoof 2019 dataset.
arXiv Detail & Related papers (2024-02-22T01:18:55Z) - Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown
Multi-Class Ensemble of CNNs [1.262949092134022]
Novel strategy is proposed to attribute a synthetic speech track to the generator that is used to synthesize it.
The proposed detector transforms the audio into log-mel spectrogram, extracts features using CNN, and classifies it between five known and unknown algorithms.
The method outperforms other top teams in accuracy by 12-13% on Eval 2 and 1-2% on Eval 1, in the IEEE SP Cup challenge at ICASSP 2022.
arXiv Detail & Related papers (2023-09-15T04:26:39Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - DSVAE: Interpretable Disentangled Representation for Synthetic Speech
Detection [25.451749986565375]
We propose Dis Spectrogram Variational Autoentangle (DSVAE) to generate interpretable representations of a speech signal for detecting synthetic speech.
Our experimental results show high accuracy (>98%) on detecting synthetic speech from 6 known and 10 out of 11 unknown speech synthesizers.
arXiv Detail & Related papers (2023-04-06T18:37:26Z) - Transformer-Based Speech Synthesizer Attribution in an Open Set Scenario [16.93803259128475]
Speech synthesis methods can create realistic-sounding speech, which may be used for fraud, spoofing, and misinformation campaigns.
Forensic attribution methods identify the specific speech synthesis method used to create a speech signal.
We propose a speech attribution method that generalizes to new synthesizers not seen during training.
arXiv Detail & Related papers (2022-10-14T05:55:21Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.