The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio
- URL: http://arxiv.org/abs/2405.04880v2
- Date: Wed, 15 May 2024 12:24:52 GMT
- Title: The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio
- Authors: Yuankun Xie, Yi Lu, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Jianhua Tao, Xin Qi, Xiaopeng Wang, Yukun Liu, Haonan Cheng, Long Ye, Yi Sun,
- Abstract summary: ALM-based deepfake audio exhibits widespread, high deception, and type versatility.
To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method.
We propose the CSAM strategy to learn a domain balanced and generalized minima.
- Score: 42.84634652376024
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods. ALM-based deepfake audio currently exhibits widespread, high deception, and type versatility, posing a significant challenge to current audio deepfake detection (ADD) models trained solely on vocoded data. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including 2 languages, over 1M audio samples, and various test conditions, focus on ALM-based audio detection. As countermeasure, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we propose the CSAM strategy to learn a domain balanced and generalized minima. In our experiments, we first demonstrate that ADD model training with the Codecfake dataset can effectively detects ALM-based audio. Furthermore, our proposed generalization countermeasure yields the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models. The dataset and associated code are available online.
Related papers
- SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio? [40.38305757279412]
Audio Language Models (ALMs) are rapidly advancing due to the developments in large language models and audio neural codecs.
This paper investigate the effectiveness of current countermeasure (CM) against ALM-based audio.
Our findings reveal that the latest-trained CM can effectively detect ALM-based audio, achieving 0% equal error rate under most ALM test conditions.
arXiv Detail & Related papers (2024-08-20T13:45:34Z) - Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio [40.21394391724075]
Large Language Model (LLM) based deepfake audio is an urgent need for effective detection methods.
We propose Codecfake, which is generated by seven representative neural methods.
Experiment results show that neural-trained ADD models exhibit a 41.406% reduction in average equal error rate compared to vocoder-trained ADD models.
arXiv Detail & Related papers (2024-06-12T11:47:23Z) - Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors.
In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z) - Cross-Domain Audio Deepfake Detection: Dataset and Analysis [11.985093463886056]
Audio deepfake detection (ADD) is essential for preventing the misuse of synthetic voices that may infringe on personal rights and privacy.
Recent zero-shot text-to-speech (TTS) models pose higher risks as they can clone voices with a single utterance.
We construct a new cross-domain ADD dataset comprising over 300 hours of speech data that is generated by five advanced zero-shot TTS models.
arXiv Detail & Related papers (2024-04-07T10:10:15Z) - Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset [21.90332221144928]
We propose the AV-Deepfake1M dataset for the detection and localization of deepfake audio-visual content.
The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos.
arXiv Detail & Related papers (2023-11-26T14:17:51Z) - AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio
Detection [54.20974251478516]
We propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting.
When fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances.
Our method can easily be generalized to related fields, like speech emotion recognition.
arXiv Detail & Related papers (2023-08-07T05:05:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.