RW-Resnet: A Novel Speech Anti-Spoofing Model Using Raw Waveform
- URL: http://arxiv.org/abs/2108.05684v2
- Date: Fri, 13 Aug 2021 01:56:02 GMT
- Title: RW-Resnet: A Novel Speech Anti-Spoofing Model Using Raw Waveform
- Authors: Youxuan Ma, Zongze Ren, Shugong Xu
- Abstract summary: We propose a new speech anti-spoofing model named ResWavegram-Resnet.
The RW-Resnet achieves better performance than other state-of-the-art anti-spoofing models.
- Score: 12.75508520935682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, synthetic speech generated by advanced text-to-speech (TTS)
and voice conversion (VC) systems has caused great harms to automatic speaker
verification (ASV) systems, urging us to design a synthetic speech detection
system to protect ASV systems. In this paper, we propose a new speech
anti-spoofing model named ResWavegram-Resnet (RW-Resnet). The model contains
two parts, Conv1D Resblocks and backbone Resnet34. The Conv1D Resblock is based
on the Conv1D block with a residual connection. For the first part, we use the
raw waveform as input and feed it to the stacked Conv1D Resblocks to get the
ResWavegram. Compared with traditional methods, ResWavegram keeps all the
information from the audio signal and has a stronger ability in extracting
features. For the second part, the extracted features are fed to the backbone
Resnet34 for the spoofed or bonafide decision. The ASVspoof2019 logical access
(LA) corpus is used to evaluate our proposed RW-Resnet. Experimental results
show that the RW-Resnet achieves better performance than other state-of-the-art
anti-spoofing models, which illustrates its effectiveness in detecting
synthetic speech attacks.
Related papers
- Comparative Analysis of the wav2vec 2.0 Feature Extractor [42.18541127866435]
We study the capability to replace the standard feature extraction methods in a connectionist temporal classification (CTC) ASR model.
We show that both are competitive with traditional FEs on the LibriSpeech benchmark and analyze the effect of the individual components.
arXiv Detail & Related papers (2023-08-08T14:29:35Z) - Miipher: A Robust Speech Restoration Model Integrating Self-Supervised
Speech and Text Representations [51.89856133895233]
Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones.
In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application.
To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) a text representation extracted from transcripts via PnG-BERT as a linguistic conditioning feature.
arXiv Detail & Related papers (2023-03-03T01:57:16Z) - Synthetic Voice Detection and Audio Splicing Detection using
SE-Res2Net-Conformer Architecture [2.9805017559176883]
This paper extends the existing Res2Net by involving the recent Conformer block to further exploit the local patterns on acoustic features.
Experimental results on ASVspoof 2019 database show that the proposed SE-Res2Net-Conformer architecture is able to improve the spoofing countermeasures performance.
This paper also proposes to re-formulate the existing audio splicing detection problem.
arXiv Detail & Related papers (2022-10-07T14:30:13Z) - ConvNext Based Neural Network for Anti-Spoofing [6.047242590232868]
Automatic speaker verification (ASV) has been widely used in the real life for identity authentication.
With the rapid development of speech conversion, speech algorithms and the improvement of the quality of recording devices, ASV systems are vulnerable for spoof attacks.
arXiv Detail & Related papers (2022-09-14T05:53:37Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Anti-Spoofing Using Transfer Learning with Variational Information
Bottleneck [6.918364447822298]
We propose a transfer learning scheme based on the wav2vec 2.0 pretrained model with variational information bottleneck for speech anti-spoofing task.
Our method improves the performance of distinguishing unseen spoofed and genuine speech, outperforming current state-of-the-art anti-spoofing systems.
arXiv Detail & Related papers (2022-04-04T11:08:21Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - Spotting adversarial samples for speaker verification by neural vocoders [102.1486475058963]
We adopt neural vocoders to spot adversarial samples for automatic speaker verification (ASV)
We find that the difference between the ASV scores for the original and re-synthesize audio is a good indicator for discrimination between genuine and adversarial samples.
Our codes will be made open-source for future works to do comparison.
arXiv Detail & Related papers (2021-07-01T08:58:16Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Replay and Synthetic Speech Detection with Res2net Architecture [85.20912636149552]
Existing approaches for replay and synthetic speech detection still lack generalizability to unseen spoofing attacks.
This work proposes to leverage a novel model structure, so-called Res2Net, to improve the anti-spoofing countermeasure's generalizability.
arXiv Detail & Related papers (2020-10-28T14:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.