TranssionADD: A multi-frame reinforcement based sequence tagging model
for audio deepfake detection
- URL: http://arxiv.org/abs/2306.15212v1
- Date: Tue, 27 Jun 2023 05:18:25 GMT
- Title: TranssionADD: A multi-frame reinforcement based sequence tagging model
for audio deepfake detection
- Authors: Jie Liu and Zhiba Su and Hui Huang and Caiyan Wan and Quanxiu Wang and
Jiangli Hong and Benlai Tang and Fengjie Zhu
- Abstract summary: The second Audio Deepfake Detection Challenge (ADD 2023) aims to detect and analyze deepfake speech utterances.
We propose our novel TranssionADD system as a solution to the challenging problem of model robustness and audio segment outliers.
Our best submission achieved 2nd place in Track 2, demonstrating the effectiveness and robustness of our proposed system.
- Score: 11.27584658526063
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Thanks to recent advancements in end-to-end speech modeling technology, it
has become increasingly feasible to imitate and clone a user`s voice. This
leads to a significant challenge in differentiating between authentic and
fabricated audio segments. To address the issue of user voice abuse and misuse,
the second Audio Deepfake Detection Challenge (ADD 2023) aims to detect and
analyze deepfake speech utterances. Specifically, Track 2, named the
Manipulation Region Location (RL), aims to pinpoint the location of manipulated
regions in audio, which can be present in both real and generated audio
segments. We propose our novel TranssionADD system as a solution to the
challenging problem of model robustness and audio segment outliers in the trace
competition. Our system provides three unique contributions: 1) we adapt
sequence tagging task for audio deepfake detection; 2) we improve model
generalization by various data augmentation techniques; 3) we incorporate
multi-frame detection (MFD) module to overcome limited representation provided
by a single frame and use isolated-frame penalty (IFP) loss to handle outliers
in segments. Our best submission achieved 2nd place in Track 2, demonstrating
the effectiveness and robustness of our proposed system.
Related papers
- Statistics-aware Audio-visual Deepfake Detector [11.671275975119089]
Methods in audio-visualfake detection mostly assess the synchronization between audio and visual features.
We propose a statistical feature loss to enhance the discrimination capability of the model.
Experiments on the DFDC and FakeAVCeleb datasets demonstrate the relevance of the proposed method.
arXiv Detail & Related papers (2024-07-16T12:15:41Z) - Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles [48.208214762257136]
It employs two models: a lightweight on-device model for real-time processing of the audio stream and a verification model on the server-side.
To protect privacy, audio features are sent to the cloud instead of raw audio.
arXiv Detail & Related papers (2023-10-17T16:22:18Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Deep Spectro-temporal Artifacts for Detecting Synthesized Speech [57.42110898920759]
This paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection)
In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features.
We ranked 4th and 5th in track 1 and track 2, respectively.
arXiv Detail & Related papers (2022-10-11T08:31:30Z) - Synthetic Voice Detection and Audio Splicing Detection using
SE-Res2Net-Conformer Architecture [2.9805017559176883]
This paper extends the existing Res2Net by involving the recent Conformer block to further exploit the local patterns on acoustic features.
Experimental results on ASVspoof 2019 database show that the proposed SE-Res2Net-Conformer architecture is able to improve the spoofing countermeasures performance.
This paper also proposes to re-formulate the existing audio splicing detection problem.
arXiv Detail & Related papers (2022-10-07T14:30:13Z) - Partially Fake Audio Detection by Self-attention-based Fake Span
Discovery [89.21979663248007]
We propose a novel framework by introducing the question-answering (fake span discovery) strategy with the self-attention mechanism to detect partially fake audios.
Our submission ranked second in the partially fake audio detection track of ADD 2022.
arXiv Detail & Related papers (2022-02-14T13:20:55Z) - Audio-visual Speech Separation with Adversarially Disentangled Visual
Representation [23.38624506211003]
Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers.
In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem.
Our proposed model is shown to outperform the state-of-the-art audio-only model and three audio-visual models.
arXiv Detail & Related papers (2020-11-29T10:48:42Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z) - DeepMSRF: A novel Deep Multimodal Speaker Recognition framework with
Feature selection [2.495606047371841]
We propose DeepMSRF, Deep Multimodal Speaker Recognition with Feature selection.
We execute DeepMSRF by feeding features of the two modalities, namely speakers' audios and face images.
The goal of DeepMSRF is to identify the gender of the speaker first, and further to recognize his or her name for any given video stream.
arXiv Detail & Related papers (2020-07-14T04:28:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.