Detection of AI Synthesized Hindi Speech
- URL: http://arxiv.org/abs/2203.03706v1
- Date: Mon, 7 Mar 2022 21:13:54 GMT
- Title: Detection of AI Synthesized Hindi Speech
- Authors: Karan Bhatia (1), Ansh Agrawal (1), Priyanka Singh (1) and Arun Kumar
Singh (2) ((1) Dhirubhai Ambani Institute of Information and Communication
Technology, (2) Indian Institute of Technology Jammu)
- Abstract summary: We propose an approach for discrimination of AI synthesized Hindi speech from an actual human speech.
We have exploited the Bicoherence Phase, Bicoherence Magnitude, Mel Frequency Cepstral Coefficient (MFCC), Delta Cepstral, and Delta Square Cepstral as the discriminating features for machine learning models.
We obtained an accuracy of 99.83% with VGG16 and 99.99% with homemade CNN models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent advancements in generative artificial speech models have made
possible the generation of highly realistic speech signals. At first, it seems
exciting to obtain these artificially synthesized signals such as speech clones
or deep fakes but if left unchecked, it may lead us to digital dystopia. One of
the primary focus in audio forensics is validating the authenticity of a
speech. Though some solutions are proposed for English speeches but the
detection of synthetic Hindi speeches have not gained much attention. Here, we
propose an approach for discrimination of AI synthesized Hindi speech from an
actual human speech. We have exploited the Bicoherence Phase, Bicoherence
Magnitude, Mel Frequency Cepstral Coefficient (MFCC), Delta Cepstral, and Delta
Square Cepstral as the discriminating features for machine learning models.
Also, we extend the study to using deep neural networks for extensive
experiments, specifically VGG16 and homemade CNN as the architecture models. We
obtained an accuracy of 99.83% with VGG16 and 99.99% with homemade CNN models.
Related papers
- Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - Every Breath You Don't Take: Deepfake Speech Detection Using Breath [6.858439600092057]
Deepfake speech represents a real and growing threat to systems and society.
Many detectors have been created to aid in defense against speech deepfakes.
We hypothesize that breath, a higher-level part of speech, is a key component of natural speech and thus improper generation in deepfake speech is a performant discriminator.
arXiv Detail & Related papers (2024-04-23T15:48:51Z) - SpeechAlign: Aligning Speech Generation to Human Preferences [51.684183257809075]
We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences.
We show that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model.
arXiv Detail & Related papers (2024-04-08T15:21:17Z) - Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown
Multi-Class Ensemble of CNNs [1.262949092134022]
Novel strategy is proposed to attribute a synthetic speech track to the generator that is used to synthesize it.
The proposed detector transforms the audio into log-mel spectrogram, extracts features using CNN, and classifies it between five known and unknown algorithms.
The method outperforms other top teams in accuracy by 12-13% on Eval 2 and 1-2% on Eval 1, in the IEEE SP Cup challenge at ICASSP 2022.
arXiv Detail & Related papers (2023-09-15T04:26:39Z) - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - Deep Speech Based End-to-End Automated Speech Recognition (ASR) for
Indian-English Accents [0.0]
We have used transfer learning approach to develop an end-to-end speech recognition system for Indian-English accents.
Indic TTS data of Indian-English accents is used for transfer learning and fine-tuning the pre-trained Deep Speech model.
arXiv Detail & Related papers (2022-04-03T03:11:21Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Using Deep Learning Techniques and Inferential Speech Statistics for AI
Synthesised Speech Recognition [0.0]
We propose a model that can help discriminate a synthesized speech from an actual human speech and also identify the source of such a synthesis.
The model outperforms the state-of-the-art approaches by classifying the AI synthesized audio from real human speech with an error rate of 1.9% and detecting the underlying architecture with an accuracy of 97%.
arXiv Detail & Related papers (2021-07-23T18:43:10Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Detection of AI-Synthesized Speech Using Cepstral & Bispectral
Statistics [0.0]
We propose an approach to distinguish human speech from AI synthesized speech.
Higher-order statistics have less correlation for human speech in comparison to a synthesized speech.
Also, Cepstral analysis revealed a durable power component in human speech that is missing for a synthesized speech.
arXiv Detail & Related papers (2020-09-03T21:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.