Related papers: Pitch Imperfect: Detecting Audio Deepfakes Through Acoustic Prosodic Analysis

Pitch Imperfect: Detecting Audio Deepfakes Through Acoustic Prosodic Analysis

URL: http://arxiv.org/abs/2502.14726v1
Date: Thu, 20 Feb 2025 16:52:55 GMT
Title: Pitch Imperfect: Detecting Audio Deepfakes Through Acoustic Prosodic Analysis
Authors: Kevin Warren, Daniel Olszewski, Seth Layton, Kevin Butler, Carrie Gates, Patrick Traynor,
Abstract summary: We explore the use of prosody, or the high-level linguistic features of human speech, as a more foundational means of detecting audio deepfakes.<n>We develop a detector based on six classical prosodic features and demonstrate that our model performs as well as other baseline models.<n>We show that we can explain the prosodic features that have highest impact on the model's decision.
Score: 6.858439600092057
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio deepfakes are increasingly in-differentiable from organic speech, often fooling both authentication systems and human listeners. While many techniques use low-level audio features or optimization black-box model training, focusing on the features that humans use to recognize speech will likely be a more long-term robust approach to detection. We explore the use of prosody, or the high-level linguistic features of human speech (e.g., pitch, intonation, jitter) as a more foundational means of detecting audio deepfakes. We develop a detector based on six classical prosodic features and demonstrate that our model performs as well as other baseline models used by the community to detect audio deepfakes with an accuracy of 93% and an EER of 24.7%. More importantly, we demonstrate the benefits of using a linguistic features-based approach over existing models by applying an adaptive adversary using an $L_{\infty}$ norm attack against the detectors and using attention mechanisms in our training for explainability. We show that we can explain the prosodic features that have highest impact on the model's decision (Jitter, Shimmer and Mean Fundamental Frequency) and that other models are extremely susceptible to simple $L_{\infty}$ norm attacks (99.3% relative degradation in accuracy). While overall performance may be similar, we illustrate the robustness and explainability benefits to a prosody feature approach to audio deepfake detection.

Related papers

AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation [55.607230723223346]
This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges.<n>We explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking.<n>We introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on
arXiv Detail & Related papers (2025-07-17T00:39:18Z)
Anomaly Detection and Localization for Speech Deepfakes via Feature Pyramid Matching [8.466707742593078]
Speech deepfakes are synthetic audio signals that can imitate target speakers' voices. Existing methods for detecting speech deepfakes rely on supervised learning. We introduce a novel interpretable one-class detection framework, which reframes speech deepfake detection as an anomaly detection task.
arXiv Detail & Related papers (2025-03-23T11:15:22Z)
Measuring the Robustness of Audio Deepfake Detectors [59.09338266364506]
This work systematically evaluates the robustness of 10 audio deepfake detection models against 16 common corruptions. Using both traditional deep learning models and state-of-the-art foundation models, we make four unique observations.
arXiv Detail & Related papers (2025-03-21T23:21:17Z)
Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark. It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content. It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
Investigating Causal Cues: Strengthening Spoofed Audio Detection with Human-Discernible Linguistic Features [0.353122873734926]
Several types of spoofed audio, such as mimicry, replay attacks, and deepfakes, have created societal challenges to information integrity. Recently, researchers have worked with sociolinguistics experts to label spoofed audio samples with Expert Defined Linguistic Features (EDLFs) It is established that there is an improvement in several deepfake detection algorithms when they augmented the traditional and common features of audio data with EDLFs.
arXiv Detail & Related papers (2024-09-09T19:47:57Z)
Statistics-aware Audio-visual Deepfake Detector [11.671275975119089]
Methods in audio-visualfake detection mostly assess the synchronization between audio and visual features. We propose a statistical feature loss to enhance the discrimination capability of the model. Experiments on the DFDC and FakeAVCeleb datasets demonstrate the relevance of the proposed method.
arXiv Detail & Related papers (2024-07-16T12:15:41Z)
Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z)
Deepfake audio detection by speaker verification [79.99653758293277]
We propose a new detection approach that leverages only the biometric characteristics of the speaker, with no reference to specific manipulations. The proposed approach can be implemented based on off-the-shelf speaker verification tools. We test several such solutions on three popular test sets, obtaining good performance, high generalization ability, and high robustness to audio impairment.
arXiv Detail & Related papers (2022-09-28T13:46:29Z)
A Lightweight Speaker Recognition System Using Timbre Properties [0.5708902722746041]
We propose a lightweight text-independent speaker recognition model based on random forest classifier. It also introduces new features that are used for both speaker verification and identification tasks. The prototype uses seven most actively searched properties, boominess, brightness, depth, hardness, timbre, sharpness, and warmth.
arXiv Detail & Related papers (2020-10-12T07:56:03Z)
Audio Impairment Recognition Using a Correlation-Based Feature Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs. We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z)
Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features. We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.