What to Remember: Self-Adaptive Continual Learning for Audio Deepfake
Detection
- URL: http://arxiv.org/abs/2312.09651v1
- Date: Fri, 15 Dec 2023 09:52:17 GMT
- Title: What to Remember: Self-Adaptive Continual Learning for Audio Deepfake
Detection
- Authors: Xiaohui Zhang, Jiangyan Yi, Chenglong Wang, Chuyuan Zhang, Siding
Zeng, Jianhua Tao
- Abstract summary: Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types.
We propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection.
- Score: 53.063161380423715
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid evolution of speech synthesis and voice conversion has raised
substantial concerns due to the potential misuse of such technology, prompting
a pressing need for effective audio deepfake detection mechanisms. Existing
detection models have shown remarkable success in discriminating known deepfake
audio, but struggle when encountering new attack types. To address this
challenge, one of the emergent effective approaches is continual learning. In
this paper, we propose a continual learning approach called Radian Weight
Modification (RWM) for audio deepfake detection. The fundamental concept
underlying RWM involves categorizing all classes into two groups: those with
compact feature distributions across tasks, such as genuine audio, and those
with more spread-out distributions, like various types of fake audio. These
distinctions are quantified by means of the in-class cosine distance, which
subsequently serves as the basis for RWM to introduce a trainable gradient
modification direction for distinct data types. Experimental evaluations
against mainstream continual learning methods reveal the superiority of RWM in
terms of knowledge acquisition and mitigating forgetting in audio deepfake
detection. Furthermore, RWM's applicability extends beyond audio deepfake
detection, demonstrating its potential significance in diverse machine learning
domains such as image recognition.
Related papers
- Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization [3.9440964696313485]
In the digital age, the emergence of deepfakes and synthetic media presents a significant threat to societal and political integrity.
Deepfakes based on multi-modal manipulation, such as audio-visual, are more realistic and pose a greater threat.
We propose a novel multi-modal attention framework based on recurrent neural networks (RNNs) that leverages contextual information for audio-visual deepfake detection.
arXiv Detail & Related papers (2024-08-02T18:45:01Z) - Statistics-aware Audio-visual Deepfake Detector [11.671275975119089]
Methods in audio-visualfake detection mostly assess the synchronization between audio and visual features.
We propose a statistical feature loss to enhance the discrimination capability of the model.
Experiments on the DFDC and FakeAVCeleb datasets demonstrate the relevance of the proposed method.
arXiv Detail & Related papers (2024-07-16T12:15:41Z) - Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion Strategy [39.93628750014384]
We propose Real Emphasis and Fake Dispersion (REFD) strategy for audio deepfake algorithm recognition.
REFD achieves 86.83% F1-score as a single system in Audio Deepfake Detection Challenge 2023 Track3, showcasing its state-of-the-art performance.
arXiv Detail & Related papers (2024-06-05T13:16:55Z) - Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors.
In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z) - MIS-AVoiDD: Modality Invariant and Specific Representation for
Audio-Visual Deepfake Detection [4.659427498118277]
A novel kind of deepfakes has emerged with either audio or visual modalities manipulated.
Existing multimodal deepfake detectors are often based on the fusion of the audio and visual streams from the video.
In this paper, we tackle the problem at the representation level to aid the fusion of audio and visual streams for multimodal deepfake detection.
arXiv Detail & Related papers (2023-10-03T17:43:24Z) - Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio
Detection [54.20974251478516]
We propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting.
When fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances.
Our method can easily be generalized to related fields, like speech emotion recognition.
arXiv Detail & Related papers (2023-08-07T05:05:49Z) - Self-supervised Transformer for Deepfake Detection [112.81127845409002]
Deepfake techniques in real-world scenarios require stronger generalization abilities of face forgery detectors.
Inspired by transfer learning, neural networks pre-trained on other large-scale face-related tasks may provide useful features for deepfake detection.
In this paper, we propose a self-supervised transformer based audio-visual contrastive learning method.
arXiv Detail & Related papers (2022-03-02T17:44:40Z) - Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods.
Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art.
In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.