Related papers: XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark

XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark

URL: http://arxiv.org/abs/2506.00462v1
Date: Sat, 31 May 2025 08:28:36 GMT
Title: XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark
Authors: Ioan-Paul Ciobanu, Andrei-Iulian Hiji, Nicolae-Catalin Ristea, Paul Irofti, Cristian Rusu, Radu Tudor Ionescu,
Abstract summary: XMAD-Bench is a large-scale cross-domain multilingual audio deepfake benchmark comprising 668.8 hours of real and deepfake speech.<n>Our benchmark highlights the need for the development of robust audio deepfake detectors.
Score: 28.171858958370947
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent advances in audio generation led to an increasing number of deepfakes, making the general public more vulnerable to financial scams, identity theft, and misinformation. Audio deepfake detectors promise to alleviate this issue, with many recent studies reporting accuracy rates close to 99%. However, these methods are typically tested in an in-domain setup, where the deepfake samples from the training and test sets are produced by the same generative models. To this end, we introduce XMAD-Bench, a large-scale cross-domain multilingual audio deepfake benchmark comprising 668.8 hours of real and deepfake speech. In our novel dataset, the speakers, the generative methods, and the real audio sources are distinct across training and test splits. This leads to a challenging cross-domain evaluation setup, where audio deepfake detectors can be tested ``in the wild''. Our in-domain and cross-domain experiments indicate a clear disparity between the in-domain performance of deepfake detectors, which is usually as high as 100%, and the cross-domain performance of the same models, which is sometimes similar to random chance. Our benchmark highlights the need for the development of robust audio deepfake detectors, which maintain their generalization capacity across different languages, speakers, generative methods, and data sources. Our benchmark is publicly released at https://github.com/ristea/xmad-bench/.

Related papers

MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark [108.46287432944392]
We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection.<n>Our dataset comprises over 250 hours of real and fake videos across eight languages.<n>For each language, the fake videos are generated with seven distinct deepfake generation models.
arXiv Detail & Related papers (2025-05-16T10:42:30Z)
End-to-end Audio Deepfake Detection from RAW Waveforms: a RawNet-Based Approach with Cross-Dataset Evaluation [8.11594945165255]
We propose an end-to-end deep learning framework for audio deepfake detection that operates directly on raw waveforms.<n>Our model, RawNetLite, is a lightweight convolutional-recurrent architecture designed to capture both spectral and temporal features without handcrafted preprocessing.
arXiv Detail & Related papers (2025-04-29T16:38:23Z)
Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook [101.30779332427217]
We survey deepfake generation and detection techniques, including the most recent developments in the field.<n>We identify various kinds of deepfakes, according to the procedure used to alter or generate the fake content.<n>We develop a novel multimodal benchmark to evaluate deepfake detectors on out-of-distribution content.
arXiv Detail & Related papers (2024-11-29T08:29:25Z)
DF40: Toward Next-Generation Deepfake Detection [62.073997142001424]
existing works identify top-notch detection algorithms and models by adhering to the common practice: training detectors on one specific dataset and testing them on other prevalent deepfake datasets. But can these stand-out "winners" be truly applied to tackle the myriad of realistic and diverse deepfakes lurking in the real world? We construct a highly diverse deepfake detection dataset called DF40, which comprises 40 distinct deepfake techniques.
arXiv Detail & Related papers (2024-06-19T12:35:02Z)
Benchmarking Cross-Domain Audio-Visual Deception Detection [45.342156006617394]
We present the first cross-domain audio-visual deception detection benchmark. We compare single-to-single and multi-to-single domain generalization performance. We propose an algorithm to enhance the generalization performance.
arXiv Detail & Related papers (2024-05-11T12:06:31Z)
Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z)
SceneFake: An Initial Dataset and Benchmarks for Scene Fake Audio Detection [54.74467470358476]
This paper proposes a dataset for scene fake audio detection named SceneFake. A manipulated audio is generated by only tampering with the acoustic scene of an original audio. Some scene fake audio detection benchmark results on the SceneFake dataset are reported in this paper.
arXiv Detail & Related papers (2022-11-11T09:05:50Z)
Does Audio Deepfake Detection Generalize? [6.415366195115544]
We systematize audio spoofing detection by re-implementing and uniformly evaluating architectures from related work. We publish a new dataset consisting of 37.9 hours of found audio recordings of celebrities and politicians, of which 17.2 hours are deepfakes. This may suggest that the community has tailored its solutions too closely to the prevailing ASVSpoof benchmark and that deepfakes are much harder to detect outside the lab than previously thought.
arXiv Detail & Related papers (2022-03-30T12:48:22Z)
Voice-Face Homogeneity Tells Deepfake [56.334968246631725]
Existing detection approaches contribute to exploring the specific artifacts in deepfake videos. We propose to perform the deepfake detection from an unexplored voice-face matching view. Our model obtains significantly improved performance as compared to other state-of-the-art competitors.
arXiv Detail & Related papers (2022-03-04T09:08:50Z)
Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal and Multimodal Detectors [18.862258543488355]
Deepfakes can cause security and privacy issues. New domain of cloning human voices using deep-learning technologies is also emerging. To develop a good deepfake detector, we need a detector that can detect deepfakes of multiple modalities.
arXiv Detail & Related papers (2021-09-07T11:00:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.