Related papers: Audio Deepfake Detection in the Age of Advanced Text-to-Speech models

Audio Deepfake Detection in the Age of Advanced Text-to-Speech models

URL: http://arxiv.org/abs/2601.20510v1
Date: Wed, 28 Jan 2026 11:39:40 GMT
Title: Audio Deepfake Detection in the Age of Advanced Text-to-Speech models
Authors: Robin Singh, Aditya Yogesh Nair, Fabio Palumbo, Florian Barbaro, Anna Dyka, Lohith Rachakonda,
Abstract summary: Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech.<n>Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech, raising new challenges for audio deepfake detection. This work presents a comparative evaluation of three state-of-the-art TTS models--Dia2, Maya1, and MeloTTS--representing streaming, LLM-based, and non-autoregressive architectures. A corpus of 12,000 synthetic audio samples was generated using the Daily-Dialog dataset and evaluated against four detection frameworks, including semantic, structural, and signal-level approaches. The results reveal significant variability in detector performance across generative mechanisms: models effective against one TTS architecture may fail against others, particularly LLM-based synthesis. In contrast, a multi-view detection approach combining complementary analysis levels demonstrates robust performance across all evaluated models. These findings highlight the limitations of single-paradigm detectors and emphasize the necessity of integrated detection strategies to address the evolving landscape of audio deepfake threats.

Related papers

A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection [2.432576583937997]
Spoof-SUPERB is a benchmark for audio deepfake detection.<n>We evaluate 20 SSL models spanning generative, discriminative, and spectrogram-based architectures.
arXiv Detail & Related papers (2026-03-02T05:45:55Z)
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation [41.03487954415606]
Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language.<n>We present T2AV-, a unified benchmark for comprehensive evaluation of T2AV systems.<n>Even the strongest models fall substantially short of human-level realism and cross-modal consistency.
arXiv Detail & Related papers (2025-12-24T10:30:35Z)
Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Audio-Language Models [60.72029578488467]
Adrial audio attacks pose a significant threat to the growing use of large audio-language models (LALMs) in human-machine interactions.<n>We introduce the Chat-Audio Attacks benchmark including four distinct types of audio attacks.<n>We evaluate six state-of-the-art LALMs with voice interaction capabilities, including Gemini-1.5-Pro, GPT-4o, and others.
arXiv Detail & Related papers (2024-11-22T10:30:48Z)
Exposing Synthetic Speech: Model Attribution and Detection of AI-generated Speech via Audio Fingerprints [11.703509488782345]
We introduce a training-free, yet effective approach for detecting AI-generated speech.<n>We tackle three key tasks: (1) single-model attribution in an open-world setting, (2) multi-model attribution in a closed-world setting, and (3) detection of synthetic versus real speech.
arXiv Detail & Related papers (2024-11-21T10:55:49Z)
Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.<n>It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.<n>It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems. We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability. Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z)
NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake Detection [50.33525966541906]
Existing multimodal detection methods capture audio-visual inconsistencies to expose Deepfake videos. We propose a novel Deepfake detection method to mine the correlation between Non-critical Phonemes and Visemes, termed NPVForensics. Our model can be easily adapted to the downstream Deepfake datasets with fine-tuning.
arXiv Detail & Related papers (2023-06-12T06:06:05Z)
Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method. We first use wav2vec pre-trained model to obtain a high-level representation of the speech. For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z)
Advances in Speech Vocoding for Text-to-Speech with Continuous Parameters [2.6572330982240935]
This paper presents new techniques in a continuous vocoder, that is all features are continuous and presents a flexible speech synthesis system. New continuous noise masking based on the phase distortion is proposed to eliminate the perceptual impact of the residual noise. Bidirectional long short-term memory (LSTM) and gated recurrent unit (GRU) are studied and applied to model continuous parameters for more natural-sounding like a human.
arXiv Detail & Related papers (2021-06-19T12:05:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.