Related papers: Context and Transcripts Improve Detection of Deepfake Audios of Public Figures

Context and Transcripts Improve Detection of Deepfake Audios of Public Figures

URL: http://arxiv.org/abs/2601.13464v1
Date: Mon, 19 Jan 2026 23:40:05 GMT
Title: Context and Transcripts Improve Detection of Deepfake Audios of Public Figures
Authors: Chongyang Gao, Marco Postiglione, Julian Baldwin, Natalia Denisenko, Isabel Gortner, Luke Fosdick, Chiara Pulice, Sarit Kraus, V. S. Subrahmanian,
Abstract summary: Current audio deepfake detectors only analyze the audio file without considering either context or transcripts.<n>We show that sufficient context and/or the transcript can significantly improve the efficacy of audio deepfake detectors.<n>We additionally show that CADD, via its use of context and/or transcripts, is more robust to 5 adversarial evasion strategies.
Score: 24.44957433526574
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Humans use context to assess the veracity of information. However, current audio deepfake detectors only analyze the audio file without considering either context or transcripts. We create and analyze a Journalist-provided Deepfake Dataset (JDD) of 255 public deepfakes which were primarily contributed by over 70 journalists since early 2024. We also generate a synthetic audio dataset (SYN) of dead public figures and propose a novel Context-based Audio Deepfake Detector (CADD) architecture. In addition, we evaluate performance on two large-scale datasets: ITW and P$^2$V. We show that sufficient context and/or the transcript can significantly improve the efficacy of audio deepfake detectors. Performance (measured via F1 score, AUC, and EER) of multiple baseline audio deepfake detectors and traditional classifiers can be improved by 5%-37.58% in F1-score, 3.77%-42.79% in AUC, and 6.17%-47.83% in EER. We additionally show that CADD, via its use of context and/or transcripts, is more robust to 5 adversarial evasion strategies, limiting performance degradation to an average of just -0.71% across all experiments. Code, models, and datasets are available at our project page: https://sites.northwestern.edu/nsail/cadd-context-based-audio-deepfake-detection (access restricted during review).

Related papers

AUDDT: Audio Unified Deepfake Detection Benchmark Toolkit [7.279026980203529]
We systematically review 28 existing audio deepfake datasets and present an open-source benchmarking toolkit called AUDDT.<n>The goal of this toolkit is to automate the evaluation of pretrained detectors across these 28 datasets, giving users direct feedback on the advantages and shortcomings of their deepfake detectors.
arXiv Detail & Related papers (2025-09-25T21:09:40Z)
AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds [38.75029700407531]
AUDETER is a large-scale, highly diverse deepfake audio dataset.<n>It consists of over 4,500 hours of synthetic audio generated by 11 recent TTS models and 10 vocoders with a broad range of TTS/vocoder patterns.<n>It is the largest deepfake audio dataset by scale.
arXiv Detail & Related papers (2025-09-04T16:03:44Z)
MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark [108.46287432944392]
We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection.<n>Our dataset comprises over 250 hours of real and fake videos across eight languages.<n>For each language, the fake videos are generated with seven distinct deepfake generation models.
arXiv Detail & Related papers (2025-05-16T10:42:30Z)
End-to-end Audio Deepfake Detection from RAW Waveforms: a RawNet-Based Approach with Cross-Dataset Evaluation [8.11594945165255]
We propose an end-to-end deep learning framework for audio deepfake detection that operates directly on raw waveforms.<n>Our model, RawNetLite, is a lightweight convolutional-recurrent architecture designed to capture both spectral and temporal features without handcrafted preprocessing.
arXiv Detail & Related papers (2025-04-29T16:38:23Z)
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.<n>We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.<n>In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z)
SafeEar: Content Privacy-Preserving Audio Deepfake Detection [17.859275594843965]
We propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within. Our key idea is to devise a neural audio into a novel decoupling model that well separates the semantic and acoustic information from audio samples. In this way, no semantic content will be exposed to the detector.
arXiv Detail & Related papers (2024-09-14T02:45:09Z)
Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z)
An RFP dataset for Real, Fake, and Partially fake audio detection [0.36832029288386137]
The paper presents the RFP da-taset, which comprises five distinct audio types: partial fake (PF), audio with noise, voice conversion (VC), text-to-speech (TTS), and real. The data are then used to evaluate several detection models, revealing that the available models incur a markedly higher equal error rate (EER) when detecting PF audio instead of entirely fake audio.
arXiv Detail & Related papers (2024-04-26T23:00:56Z)
SceneFake: An Initial Dataset and Benchmarks for Scene Fake Audio Detection [54.74467470358476]
This paper proposes a dataset for scene fake audio detection named SceneFake. A manipulated audio is generated by only tampering with the acoustic scene of an original audio. Some scene fake audio detection benchmark results on the SceneFake dataset are reported in this paper.
arXiv Detail & Related papers (2022-11-11T09:05:50Z)
Voice-Face Homogeneity Tells Deepfake [56.334968246631725]
Existing detection approaches contribute to exploring the specific artifacts in deepfake videos. We propose to perform the deepfake detection from an unexplored voice-face matching view. Our model obtains significantly improved performance as compared to other state-of-the-art competitors.
arXiv Detail & Related papers (2022-03-04T09:08:50Z)
Half-Truth: A Partially Fake Audio Detection Dataset [60.08010668752466]
This paper develops a dataset for half-truth audio detection (HAD) Partially fake audio in the HAD dataset involves only changing a few words in an utterance. We can not only detect fake uttrances but also localize manipulated regions in a speech using this dataset.
arXiv Detail & Related papers (2021-04-08T08:57:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.