Exposing Synthetic Speech: Model Attribution and Detection of AI-generated Speech via Audio Fingerprints
- URL: http://arxiv.org/abs/2411.14013v3
- Date: Thu, 04 Sep 2025 12:43:52 GMT
- Title: Exposing Synthetic Speech: Model Attribution and Detection of AI-generated Speech via Audio Fingerprints
- Authors: MatÃas Pizarro, Mike Laszkiewicz, Shawkat Hesso, Dorothea Kolossa, Asja Fischer,
- Abstract summary: We introduce a training-free, yet effective approach for detecting AI-generated speech.<n>We tackle three key tasks: (1) single-model attribution in an open-world setting, (2) multi-model attribution in a closed-world setting, and (3) detection of synthetic versus real speech.
- Score: 11.703509488782345
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As speech generation technologies continue to advance in quality and accessibility, the risk of malicious use cases, including impersonation, misinformation, and spoofing, increases rapidly. This work addresses this threat by introducing a simple, training-free, yet effective approach for detecting AI-generated speech and attributing it to its source model. Specifically, we tackle three key tasks: (1) single-model attribution in an open-world setting, where the goal is to determine whether a given audio sample was generated by a specific target neural speech synthesis system (with access only to data from that system); (2) multi-model attribution in a closed-world setting, where the objective is to identify the generating system from a known pool of candidates; and last but not least (3) detection of synthetic versus real speech. Our approach leverages standardized average residuals-the difference between an input audio signal and its filtered version using either a low-pass filter or the EnCodec audio autoencoder. We demonstrate that these residuals consistently capture artifacts introduced by diverse speech synthesis systems, serving as distinctive, model-agnostic fingerprints for attribution. Across extensive experiments, our approach achieves AUROC scores exceeding 99% in most scenarios, evaluated on augmented benchmark datasets that pair real speech with synthetic audio generated by multiple synthesis systems. In addition, our robustness analysis underscores the method's ability to maintain high performance even in the presence of moderate additive noise. Due to its simplicity, efficiency, and strong generalization across speech synthesis systems and languages, this technique offers a practical tool for digital forensics and security applications.
Related papers
- Hybrid Audio Detection Using Fine-Tuned Audio Spectrogram Transformers: A Dataset-Driven Evaluation of Mixed AI-Human Speech [3.195044561824979]
We construct a novel hybrid audio dataset incorporating human, AI-generated, cloned, and mixed audio samples.<n>Our approach significantly outperforms existing baselines in mixed-audio detection, achieving 97% classification accuracy.<n>Our findings highlight the importance of hybrid datasets and tailored models in advancing the robustness of speech-based authentication systems.
arXiv Detail & Related papers (2025-05-21T05:43:41Z) - Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.<n>It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.<n>It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - End-to-end streaming model for low-latency speech anonymization [11.098498920630782]
We propose a streaming model that achieves speaker anonymization with low latency.
The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder.
We present evaluation results from two implementations of our system.
arXiv Detail & Related papers (2024-06-13T16:15:53Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting [14.402357651227003]
We investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context.
To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder.
arXiv Detail & Related papers (2024-05-30T14:41:39Z) - Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models [84.8919069953397]
Self-TAught Recognizer (STAR) is an unsupervised adaptation framework for speech recognition systems.
We show that STAR achieves an average of 13.5% relative reduction in word error rate across 14 target domains.
STAR exhibits high data efficiency that only requires less than one-hour unlabeled data.
arXiv Detail & Related papers (2024-05-23T04:27:11Z) - Multimodal Data and Resource Efficient Device-Directed Speech Detection
with Large Foundation Models [43.155061160275196]
We explore the possibility of making interactions with virtual assistants more natural by eliminating the need for a trigger phrase.
Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone.
We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder.
arXiv Detail & Related papers (2023-12-06T17:29:03Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - All-for-One and One-For-All: Deep learning-based feature fusion for
Synthetic Speech Detection [18.429817510387473]
Recent advances in deep learning and computer vision have made the synthesis and counterfeiting of multimedia content more accessible than ever.
In this paper, we consider three different feature sets proposed in the literature for the synthetic speech detection task and present a model that fuses them.
The system was tested on different scenarios and datasets to prove its robustness to anti-forensic attacks and its generalization capabilities.
arXiv Detail & Related papers (2023-07-28T13:50:25Z) - Topic Identification For Spontaneous Speech: Enriching Audio Features
With Embedded Linguistic Information [10.698093106994804]
Traditional topic identification solutions from audio rely on an automatic speech recognition system (ASR) to produce transcripts.
We compare audio-only and hybrid techniques of jointly utilising text and audio features.
The models evaluated on spontaneous Finnish speech demonstrate that purely audio-based solutions are a viable option when ASR components are not available.
arXiv Detail & Related papers (2023-07-21T09:30:46Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - Wave-U-Net Discriminator: Fast and Lightweight Discriminator for
Generative Adversarial Network-Based Speech Synthesis [38.27153023145183]
In speech synthesis, a generative adversarial network (GAN) is used to train a generator (speech synthesizer) and a discriminator in a min-max game.
An ensemble of discriminators is commonly used in recent neural vocoders (HiFi-GAN) and end-to-end text-to-speech (TTS) systems.
This study proposes a Wave-U-Net discriminator, which is a single but expressive discriminator with Wave-U-Net architecture.
arXiv Detail & Related papers (2023-03-24T10:46:40Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Self-Supervised Learning for Speech Enhancement through Synthesis [5.924928860260821]
We propose a denoising vocoder (DeVo) approach, where a vocoder accepts noisy representations and learns to directly synthesize clean speech.
We demonstrate a causal version capable of running on streaming audio with 10ms latency and minimal performance degradation.
arXiv Detail & Related papers (2022-11-04T16:06:56Z) - Combining Automatic Speaker Verification and Prosody Analysis for
Synthetic Speech Detection [15.884911752869437]
We present a novel approach for synthetic speech detection, exploiting the combination of two high-level semantic properties of the human voice.
On one side, we focus on speaker identity cues and represent them as speaker embeddings extracted using a state-of-the-art method for the automatic speaker verification task.
On the other side, voice prosody, intended as variations in rhythm, pitch or accent in speech, is extracted through a specialized encoder.
arXiv Detail & Related papers (2022-10-31T11:03:03Z) - Adaptive re-calibration of channel-wise features for Adversarial Audio
Classification [0.0]
We propose a recalibration of features using attention feature fusion for synthetic speech detection.
We compare its performance against different detection methods including End2End models and Resnet-based models.
We also demonstrate that the combination of Linear frequency cepstral coefficients (LFCC) and Mel Frequency cepstral coefficients (MFCC) using the attentional feature fusion technique creates better input features representations.
arXiv Detail & Related papers (2022-10-21T04:21:56Z) - Deepfake audio detection by speaker verification [79.99653758293277]
We propose a new detection approach that leverages only the biometric characteristics of the speaker, with no reference to specific manipulations.
The proposed approach can be implemented based on off-the-shelf speaker verification tools.
We test several such solutions on three popular test sets, obtaining good performance, high generalization ability, and high robustness to audio impairment.
arXiv Detail & Related papers (2022-09-28T13:46:29Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Speech Pattern based Black-box Model Watermarking for Automatic Speech
Recognition [83.2274907780273]
How to design a black-box watermarking scheme for automatic speech recognition models is still an unsolved problem.
We propose the first black-box model watermarking framework for protecting the IP of ASR models.
Experiments on the state-of-the-art open-source ASR system DeepSpeech demonstrate the feasibility of the proposed watermarking scheme.
arXiv Detail & Related papers (2021-10-19T09:01:41Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Using multiple reference audios and style embedding constraints for
speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Knowledge Transfer for Efficient On-device False Trigger Mitigation [17.53768388104929]
An undirected utterance is termed as a "false trigger" and false trigger mitigation (FTM) is essential for designing a privacy-centric smart assistant.
We propose an LSTM-based FTM architecture which determines the user intent from acoustic features directly without explicitly generating ASR transcripts.
arXiv Detail & Related papers (2020-10-20T20:01:44Z) - VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device
Speech Recognition [60.462770498366524]
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user.
We show that such a model can be quantized as a 8-bit integer model and run in realtime.
arXiv Detail & Related papers (2020-09-09T14:26:56Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.