Synthetic Voice Detection and Audio Splicing Detection using
SE-Res2Net-Conformer Architecture
- URL: http://arxiv.org/abs/2210.03581v1
- Date: Fri, 7 Oct 2022 14:30:13 GMT
- Title: Synthetic Voice Detection and Audio Splicing Detection using
SE-Res2Net-Conformer Architecture
- Authors: Lei Wang, Benedict Yeoh, Jun Wah Ng
- Abstract summary: This paper extends the existing Res2Net by involving the recent Conformer block to further exploit the local patterns on acoustic features.
Experimental results on ASVspoof 2019 database show that the proposed SE-Res2Net-Conformer architecture is able to improve the spoofing countermeasures performance.
This paper also proposes to re-formulate the existing audio splicing detection problem.
- Score: 2.9805017559176883
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthetic voice and splicing audio clips have been generated to spoof
Internet users and artificial intelligence (AI) technologies such as voice
authentication. Existing research work treats spoofing countermeasures as a
binary classification problem: bonafide vs. spoof. This paper extends the
existing Res2Net by involving the recent Conformer block to further exploit the
local patterns on acoustic features. Experimental results on ASVspoof 2019
database show that the proposed SE-Res2Net-Conformer architecture is able to
improve the spoofing countermeasures performance for the logical access
scenario.
In addition, this paper also proposes to re-formulate the existing audio
splicing detection problem. Instead of identifying the complete splicing
segments, it is more useful to detect the boundaries of the spliced segments.
Moreover, a deep learning approach can be used to solve the problem, which is
different from the previous signal processing techniques.
Related papers
- Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles [48.208214762257136]
It employs two models: a lightweight on-device model for real-time processing of the audio stream and a verification model on the server-side.
To protect privacy, audio features are sent to the cloud instead of raw audio.
arXiv Detail & Related papers (2023-10-17T16:22:18Z) - TranssionADD: A multi-frame reinforcement based sequence tagging model
for audio deepfake detection [11.27584658526063]
The second Audio Deepfake Detection Challenge (ADD 2023) aims to detect and analyze deepfake speech utterances.
We propose our novel TranssionADD system as a solution to the challenging problem of model robustness and audio segment outliers.
Our best submission achieved 2nd place in Track 2, demonstrating the effectiveness and robustness of our proposed system.
arXiv Detail & Related papers (2023-06-27T05:18:25Z) - NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake
Detection [50.33525966541906]
Existing multimodal detection methods capture audio-visual inconsistencies to expose Deepfake videos.
We propose a novel Deepfake detection method to mine the correlation between Non-critical Phonemes and Visemes, termed NPVForensics.
Our model can be easily adapted to the downstream Deepfake datasets with fine-tuning.
arXiv Detail & Related papers (2023-06-12T06:06:05Z) - ConvNext Based Neural Network for Anti-Spoofing [6.047242590232868]
Automatic speaker verification (ASV) has been widely used in the real life for identity authentication.
With the rapid development of speech conversion, speech algorithms and the improvement of the quality of recording devices, ASV systems are vulnerable for spoof attacks.
arXiv Detail & Related papers (2022-09-14T05:53:37Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Towards Unconstrained Audio Splicing Detection and Localization with Neural Networks [6.570712059945705]
Convincing forgeries can be created by combining various speech samples from the same person.
Most existing detection algorithms for audio splicing use handcrafted features and make specific assumptions.
We propose a Transformer sequence-to-sequence (seq2seq) network for splicing detection and localization.
arXiv Detail & Related papers (2022-07-29T13:57:16Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - Attention Driven Fusion for Multi-Modal Emotion Recognition [39.295892047505816]
We present a deep learning-based approach to exploit and fuse text and acoustic data for emotion classification.
We use a SincNet layer, based on parameterized sinc functions with band-pass filters, to extract acoustic features from raw audio followed by a DCNN.
For text processing, we use two branches (a DCNN and a Bi-direction RNN followed by a DCNN) in parallel where cross attention is introduced to infer the N-gram level correlations.
arXiv Detail & Related papers (2020-09-23T08:07:58Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.