Adaptive re-calibration of channel-wise features for Adversarial Audio
Classification
- URL: http://arxiv.org/abs/2210.11722v1
- Date: Fri, 21 Oct 2022 04:21:56 GMT
- Title: Adaptive re-calibration of channel-wise features for Adversarial Audio
Classification
- Authors: Vardhan Dongre, Abhinav Thimma Reddy, Nikhitha Reddeddy
- Abstract summary: We propose a recalibration of features using attention feature fusion for synthetic speech detection.
We compare its performance against different detection methods including End2End models and Resnet-based models.
We also demonstrate that the combination of Linear frequency cepstral coefficients (LFCC) and Mel Frequency cepstral coefficients (MFCC) using the attentional feature fusion technique creates better input features representations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: DeepFake Audio, unlike DeepFake images and videos, has been relatively less
explored from detection perspective, and the solutions which exist for the
synthetic speech classification either use complex networks or dont generalize
to different varieties of synthetic speech obtained using different generative
and optimization-based methods. Through this work, we propose a channel-wise
recalibration of features using attention feature fusion for synthetic speech
detection and compare its performance against different detection methods
including End2End models and Resnet-based models on synthetic speech generated
using Text to Speech and Vocoder systems like WaveNet, WaveRNN, Tactotron, and
WaveGlow. We also experiment with Squeeze Excitation (SE) blocks in our Resnet
models and found that the combination was able to get better performance. In
addition to the analysis, we also demonstrate that the combination of Linear
frequency cepstral coefficients (LFCC) and Mel Frequency cepstral coefficients
(MFCC) using the attentional feature fusion technique creates better input
features representations which can help even simpler models generalize well on
synthetic speech classification tasks. Our models (Resnet based using feature
fusion) trained on Fake or Real (FoR) dataset and were able to achieve 95% test
accuracy with the FoR data, and an average of 90% accuracy with samples we
generated using different generative models after adapting this framework.
Related papers
- SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-filter Model [31.280358048556444]
This paper presents an advanced end-to-end singing voice synthesis (SVS) system based on the source-filter mechanism.
The proposed system also incorporates elements like the fundamental pitch (F0) predictor and waveform generation decoder.
Experiments on the Opencpop dataset demonstrate efficacy of the proposed model in intonation quality and accuracy.
arXiv Detail & Related papers (2024-10-16T13:18:45Z) - SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Decision Forest Based EMG Signal Classification with Low Volume Dataset
Augmented with Random Variance Gaussian Noise [51.76329821186873]
We produce a model that can classify six different hand gestures with a limited number of samples that generalizes well to a wider audience.
We appeal to a set of more elementary methods such as the use of random bounds on a signal, but desire to show the power these methods can carry in an online setting.
arXiv Detail & Related papers (2022-06-29T23:22:18Z) - SoundCLR: Contrastive Learning of Representations For Improved
Environmental Sound Classification [0.6767885381740952]
SoundCLR is a supervised contrastive learning method for effective environment sound classification with state-of-the-art performance.
Due to the comparatively small sizes of the available environmental sound datasets, we propose and exploit a transfer learning and strong data augmentation pipeline.
Our experiments show that our masking based augmentation technique on the log-mel spectrograms can significantly improve the recognition performance.
arXiv Detail & Related papers (2021-03-02T18:42:45Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.
The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop.
Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z) - Audio-Visual Decision Fusion for WFST-based and seq2seq Models [3.2771898634434997]
Under noisy conditions, speech recognition systems suffer from high Word Error Rates (WER)
We propose novel methods to fuse information from audio and visual modalities at inference time.
We show that our methods give significant improvements over acoustic-only WER.
arXiv Detail & Related papers (2020-01-29T13:45:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.