Detection of AI-Synthesized Speech Using Cepstral & Bispectral
Statistics
- URL: http://arxiv.org/abs/2009.01934v2
- Date: Sun, 11 Apr 2021 11:41:43 GMT
- Title: Detection of AI-Synthesized Speech Using Cepstral & Bispectral
Statistics
- Authors: Arun Kumar Singh (1), Priyanka Singh (2) ((1) Indian Institute of
Technology Jammu, (2) Dhirubhai Ambani Institute of Information and
Communication Technology)
- Abstract summary: We propose an approach to distinguish human speech from AI synthesized speech.
Higher-order statistics have less correlation for human speech in comparison to a synthesized speech.
Also, Cepstral analysis revealed a durable power component in human speech that is missing for a synthesized speech.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Digital technology has made possible unimaginable applications come true. It
seems exciting to have a handful of tools for easy editing and manipulation,
but it raises alarming concerns that can propagate as speech clones,
duplicates, or maybe deep fakes. Validating the authenticity of a speech is one
of the primary problems of digital audio forensics. We propose an approach to
distinguish human speech from AI synthesized speech exploiting the Bi-spectral
and Cepstral analysis. Higher-order statistics have less correlation for human
speech in comparison to a synthesized speech. Also, Cepstral analysis revealed
a durable power component in human speech that is missing for a synthesized
speech. We integrate both these analyses and propose a machine learning model
to detect AI synthesized speech.
Related papers
- Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown
Multi-Class Ensemble of CNNs [1.262949092134022]
Novel strategy is proposed to attribute a synthetic speech track to the generator that is used to synthesize it.
The proposed detector transforms the audio into log-mel spectrogram, extracts features using CNN, and classifies it between five known and unknown algorithms.
The method outperforms other top teams in accuracy by 12-13% on Eval 2 and 1-2% on Eval 1, in the IEEE SP Cup challenge at ICASSP 2022.
arXiv Detail & Related papers (2023-09-15T04:26:39Z) - Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion [4.251500966181852]
This study consists of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion.
It is found that the Extreme Gradient Boosting model can achieve an average classification accuracy of 99.3% and can classify speech in real-time, at around 0.004 milliseconds given one second of speech.
arXiv Detail & Related papers (2023-08-24T12:26:15Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - DSVAE: Interpretable Disentangled Representation for Synthetic Speech
Detection [25.451749986565375]
We propose Dis Spectrogram Variational Autoentangle (DSVAE) to generate interpretable representations of a speech signal for detecting synthetic speech.
Our experimental results show high accuracy (>98%) on detecting synthetic speech from 6 known and 10 out of 11 unknown speech synthesizers.
arXiv Detail & Related papers (2023-04-06T18:37:26Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Simple and Effective Unsupervised Speech Synthesis [97.56065543192699]
We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe.
Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus.
arXiv Detail & Related papers (2022-04-06T00:19:13Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Detection of AI Synthesized Hindi Speech [0.0]
We propose an approach for discrimination of AI synthesized Hindi speech from an actual human speech.
We have exploited the Bicoherence Phase, Bicoherence Magnitude, Mel Frequency Cepstral Coefficient (MFCC), Delta Cepstral, and Delta Square Cepstral as the discriminating features for machine learning models.
We obtained an accuracy of 99.83% with VGG16 and 99.99% with homemade CNN models.
arXiv Detail & Related papers (2022-03-07T21:13:54Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Speech Synthesis as Augmentation for Low-Resource ASR [7.2244067948447075]
Speech synthesis might hold the key to low-resource speech recognition.
Data augmentation techniques have become an essential part of modern speech recognition training.
Speech synthesis techniques have been rapidly getting closer to the goal of achieving human-like speech.
arXiv Detail & Related papers (2020-12-23T22:19:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.