Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning
- URL: http://arxiv.org/abs/2412.00175v1
- Date: Fri, 29 Nov 2024 18:58:20 GMT
- Title: Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning
- Authors: Dragos-Alexandru Boldisor, Stefan Smeu, Dan Oneata, Elisabeta Oneata,
- Abstract summary: We show that two of the most widely used audio-video deepfake datasets suffer from a previously unidentified spurious feature: the leading silence.
Fake videos start with a very brief moment of silence and based on this feature alone, we can separate the real and fake samples almost perfectly.
We propose a shift from supervised to unsupervised learning by training models exclusively on real data.
- Score: 3.453303606167197
- License:
- Abstract: Good datasets are essential for developing and benchmarking any machine learning system. Their importance is even more extreme for safety critical applications such as deepfake detection - the focus of this paper. Here we reveal that two of the most widely used audio-video deepfake datasets suffer from a previously unidentified spurious feature: the leading silence. Fake videos start with a very brief moment of silence and based on this feature alone, we can separate the real and fake samples almost perfectly. As such, previous audio-only and audio-video models exploit the presence of silence in the fake videos and consequently perform worse when the leading silence is removed. To circumvent latching on such unwanted artifact and possibly other unrevealed ones we propose a shift from supervised to unsupervised learning by training models exclusively on real data. We show that by aligning self-supervised audio-video representations we remove the risk of relying on dataset-specific biases and improve robustness in deepfake detection.
Related papers
- SafeEar: Content Privacy-Preserving Audio Deepfake Detection [17.859275594843965]
We propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within.
Our key idea is to devise a neural audio into a novel decoupling model that well separates the semantic and acoustic information from audio samples.
In this way, no semantic content will be exposed to the detector.
arXiv Detail & Related papers (2024-09-14T02:45:09Z) - A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection [17.285669984798975]
This paper addresses the challenge of developing a robust audio-visual deepfake detection model.
New generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods.
We propose a multi-stream fusion approach with one-class learning as a representation-level regularization technique.
arXiv Detail & Related papers (2024-06-20T10:33:15Z) - AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection [2.985620880452743]
We present Audio-Visual Feature Fusion (AVFF), a two-stage cross-modal learning method for improved deepfake detection.
To extract rich cross-modal representations, we use contrastive learning and autoencoding objectives, and introduce a novel audio-visual masking and feature fusion strategy.
We report 98.6% accuracy and 99.1% AUC on the FakeAVCeleb dataset, outperforming the current audio-visual state-of-the-art by 14.9% and 9.9%, respectively.
arXiv Detail & Related papers (2024-06-05T05:20:12Z) - Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors.
In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z) - AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - SceneFake: An Initial Dataset and Benchmarks for Scene Fake Audio Detection [54.74467470358476]
This paper proposes a dataset for scene fake audio detection named SceneFake.
A manipulated audio is generated by only tampering with the acoustic scene of an original audio.
Some scene fake audio detection benchmark results on the SceneFake dataset are reported in this paper.
arXiv Detail & Related papers (2022-11-11T09:05:50Z) - Audio Deepfake Attribution: An Initial Dataset and Investigation [41.62487394875349]
We design the first deepfake audio dataset for the attribution of audio generation tools, called Audio Deepfake Attribution (ADA)
We propose the Class- Multi-Center Learning ( CRML) method for open-set audio deepfake attribution (OSADA)
Experimental results demonstrate that the CRML method effectively addresses open-set risks in real-world scenarios.
arXiv Detail & Related papers (2022-08-21T05:15:40Z) - Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z) - Voice-Face Homogeneity Tells Deepfake [56.334968246631725]
Existing detection approaches contribute to exploring the specific artifacts in deepfake videos.
We propose to perform the deepfake detection from an unexplored voice-face matching view.
Our model obtains significantly improved performance as compared to other state-of-the-art competitors.
arXiv Detail & Related papers (2022-03-04T09:08:50Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.