EnvSDD: Benchmarking Environmental Sound Deepfake Detection
- URL: http://arxiv.org/abs/2505.19203v1
- Date: Sun, 25 May 2025 16:02:56 GMT
- Title: EnvSDD: Benchmarking Environmental Sound Deepfake Detection
- Authors: Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang, Mark D Plumbley,
- Abstract summary: Environmental sounds have different characteristics, which may make methods for detecting speech and singing deepfakes less effective for real-world sounds.<n>Existing datasets for environmental sound deepfake detection are limited in scale and audio types.<n>We introduce EnvSDD, the first large-scale curated dataset designed for this task, consisting of 45.25 hours of real and 316.74 hours of fake audio.
- Score: 32.52097731108311
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio generation systems now create very realistic soundscapes that can enhance media production, but also pose potential risks. Several studies have examined deepfakes in speech or singing voice. However, environmental sounds have different characteristics, which may make methods for detecting speech and singing deepfakes less effective for real-world sounds. In addition, existing datasets for environmental sound deepfake detection are limited in scale and audio types. To address this gap, we introduce EnvSDD, the first large-scale curated dataset designed for this task, consisting of 45.25 hours of real and 316.74 hours of fake audio. The test set includes diverse conditions to evaluate the generalizability, such as unseen generation models and unseen datasets. We also propose an audio deepfake detection system, based on a pre-trained audio foundation model. Results on EnvSDD show that our proposed system outperforms the state-of-the-art systems from speech and singing domains.
Related papers
- MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark [108.46287432944392]
We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection.<n>Our dataset comprises over 250 hours of real and fake videos across eight languages.<n>For each language, the fake videos are generated with seven distinct deepfake generation models.
arXiv Detail & Related papers (2025-05-16T10:42:30Z) - End-to-end Audio Deepfake Detection from RAW Waveforms: a RawNet-Based Approach with Cross-Dataset Evaluation [8.11594945165255]
We propose an end-to-end deep learning framework for audio deepfake detection that operates directly on raw waveforms.<n>Our model, RawNetLite, is a lightweight convolutional-recurrent architecture designed to capture both spectral and temporal features without handcrafted preprocessing.
arXiv Detail & Related papers (2025-04-29T16:38:23Z) - SafeEar: Content Privacy-Preserving Audio Deepfake Detection [17.859275594843965]
We propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within.
Our key idea is to devise a neural audio into a novel decoupling model that well separates the semantic and acoustic information from audio samples.
In this way, no semantic content will be exposed to the detector.
arXiv Detail & Related papers (2024-09-14T02:45:09Z) - An RFP dataset for Real, Fake, and Partially fake audio detection [0.36832029288386137]
The paper presents the RFP da-taset, which comprises five distinct audio types: partial fake (PF), audio with noise, voice conversion (VC), text-to-speech (TTS), and real.
The data are then used to evaluate several detection models, revealing that the available models incur a markedly higher equal error rate (EER) when detecting PF audio instead of entirely fake audio.
arXiv Detail & Related papers (2024-04-26T23:00:56Z) - Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - SceneFake: An Initial Dataset and Benchmarks for Scene Fake Audio Detection [54.74467470358476]
This paper proposes a dataset for scene fake audio detection named SceneFake.
A manipulated audio is generated by only tampering with the acoustic scene of an original audio.
Some scene fake audio detection benchmark results on the SceneFake dataset are reported in this paper.
arXiv Detail & Related papers (2022-11-11T09:05:50Z) - On Out-of-Distribution Detection for Audio with Deep Nearest Neighbors [3.591566487849146]
Out-of-distribution (OOD) detection is concerned with identifying data points that do not belong to the same distribution as the model's training data.
We show that this simple and flexible method effectively detects OOD inputs across a broad category of audio (and speech) datasets.
arXiv Detail & Related papers (2022-10-27T09:35:33Z) - Audio Deepfake Attribution: An Initial Dataset and Investigation [41.62487394875349]
We design the first deepfake audio dataset for the attribution of audio generation tools, called Audio Deepfake Attribution (ADA)
We propose the Class- Multi-Center Learning ( CRML) method for open-set audio deepfake attribution (OSADA)
Experimental results demonstrate that the CRML method effectively addresses open-set risks in real-world scenarios.
arXiv Detail & Related papers (2022-08-21T05:15:40Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.