I Can Hear You: Selective Robust Training for Deepfake Audio Detection
- URL: http://arxiv.org/abs/2411.00121v1
- Date: Thu, 31 Oct 2024 18:21:36 GMT
- Title: I Can Hear You: Selective Robust Training for Deepfake Audio Detection
- Authors: Zirui Zhang, Wei Hao, Aroon Sankoh, William Lin, Emanuel Mendiola-Ortiz, Junfeng Yang, Chengzhi Mao,
- Abstract summary: We establish the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples.
Despite previously reported high accuracy, existing deepfake voice detectors struggle with our diversely collected dataset.
We propose the F-SAT: Frequency-Selective Adversarial Training method focusing on high-frequency components.
- Score: 16.52185019459127
- License:
- Abstract: Recent advances in AI-generated voices have intensified the challenge of detecting deepfake audio, posing risks for scams and the spread of disinformation. To tackle this issue, we establish the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples, including 270,000 high-quality deepfake samples from 14 diverse sources. Despite previously reported high accuracy, existing deepfake voice detectors struggle with our diversely collected dataset, and their detection success rates drop even further under realistic corruptions and adversarial attacks. We conduct a holistic investigation into factors that enhance model robustness and show that incorporating a diversified set of voice augmentations is beneficial. Moreover, we find that the best detection models often rely on high-frequency features, which are imperceptible to humans and can be easily manipulated by an attacker. To address this, we propose the F-SAT: Frequency-Selective Adversarial Training method focusing on high-frequency components. Empirical results demonstrate that using our training dataset boosts baseline model performance (without robust training) by 33%, and our robust training further improves accuracy by 7.7% on clean samples and by 29.3% on corrupted and attacked samples, over the state-of-the-art RawNet3 model.
Related papers
- Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors.
In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z) - CLAD: Robust Audio Deepfake Detection Against Manipulation Attacks with Contrastive Learning [20.625160354407974]
We study the susceptibility of the most widely adopted audio deepfake detectors to manipulation attacks.
Even manipulations like volume control can significantly bypass detection without affecting human perception.
We propose CLAD (Contrastive Learning-based Audio deepfake Detector) to enhance the robustness against manipulation attacks.
arXiv Detail & Related papers (2024-04-24T13:10:35Z) - ROPO: Robust Preference Optimization for Large Language Models [59.10763211091664]
We propose an iterative alignment approach that integrates noise-tolerance and filtering of noisy samples without the aid of external models.
Experiments on three widely-used datasets with Mistral-7B and Llama-2-7B demonstrate that ROPO significantly outperforms existing preference alignment methods.
arXiv Detail & Related papers (2024-04-05T13:58:51Z) - PITCH: AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response [14.604998731837595]
PITCH is a robust challenge-response method to detect and tag interactive deepfake audio calls.
We developed a comprehensive taxonomy of audio challenges based on the human auditory system, linguistics, and environmental factors.
Our solution gave users maximum control and boosted detection accuracy to 84.5%.
arXiv Detail & Related papers (2024-02-28T06:17:55Z) - Improved DeepFake Detection Using Whisper Features [2.846767128062884]
We investigate the influence of Whisper automatic speech recognition model as a DF detection front-end.
We show that using Whisper-based features improves the detection for each model and outperforms recent results on the In-The-Wild dataset by reducing Equal Error Rate by 21%.
arXiv Detail & Related papers (2023-06-02T10:34:05Z) - Scenario Aware Speech Recognition: Advancements for Apollo Fearless
Steps & CHiME-4 Corpora [70.46867541361982]
We consider a general non-semantic speech representation, which is trained with a self-supervised criteria based on triplet loss called TRILL.
We observe +5.42% and +3.18% relative WER improvement for the development and evaluation sets of Fearless Steps.
arXiv Detail & Related papers (2021-09-23T00:43:32Z) - Uncertainty-Aware COVID-19 Detection from Imbalanced Sound Data [15.833328435820622]
We propose an ensemble framework where multiple deep learning models for sound-based COVID-19 detection are developed.
It is shown that false predictions often yield higher uncertainty.
This study paves the way for a more robust sound-based COVID-19 automated screening system.
arXiv Detail & Related papers (2021-04-05T16:54:03Z) - Adversarially robust deepfake media detection using fused convolutional
neural network predictions [79.00202519223662]
Current deepfake detection systems struggle against unseen data.
We employ three different deep Convolutional Neural Network (CNN) models to classify fake and real images extracted from videos.
The proposed technique outperforms state-of-the-art models with 96.5% accuracy.
arXiv Detail & Related papers (2021-02-11T11:28:00Z) - Detecting COVID-19 from Breathing and Coughing Sounds using Deep Neural
Networks [68.8204255655161]
We adapt an ensemble of Convolutional Neural Networks to classify if a speaker is infected with COVID-19 or not.
Ultimately, it achieves an Unweighted Average Recall (UAR) of 74.9%, or an Area Under ROC Curve (AUC) of 80.7% by ensembling neural networks.
arXiv Detail & Related papers (2020-12-29T01:14:17Z) - From Sound Representation to Model Robustness [82.21746840893658]
We investigate the impact of different standard environmental sound representations (spectrograms) on the recognition performance and adversarial attack robustness of a victim residual convolutional neural network.
Averaged over various experiments on three environmental sound datasets, we found the ResNet-18 model outperforms other deep learning architectures.
arXiv Detail & Related papers (2020-07-27T17:30:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.