Cross-Technology Generalization in Synthesized Speech Detection: Evaluating AST Models with Modern Voice Generators
- URL: http://arxiv.org/abs/2503.22503v1
- Date: Fri, 28 Mar 2025 15:07:26 GMT
- Title: Cross-Technology Generalization in Synthesized Speech Detection: Evaluating AST Models with Modern Voice Generators
- Authors: Andrew Ustinov, Matey Yordanov, Andrei Kuchma, Mikhail Bychkov,
- Abstract summary: This paper evaluates the Audio Spectrogram Transformer (AST) architecture for synthesized speech detection.<n>Using differentiated augmentation strategies, the model achieves 0.91% EER overall when tested against ElevenLabs, NotebookLM, and Minimax AI voice generators.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper evaluates the Audio Spectrogram Transformer (AST) architecture for synthesized speech detection, with focus on generalization across modern voice generation technologies. Using differentiated augmentation strategies, the model achieves 0.91% EER overall when tested against ElevenLabs, NotebookLM, and Minimax AI voice generators. Notably, after training with only 102 samples from a single technology, the model demonstrates strong cross-technology generalization, achieving 3.3% EER on completely unseen voice generators. This work establishes benchmarks for rapid adaptation to emerging synthesis technologies and provides evidence that transformer-based architectures can identify common artifacts across different neural voice synthesis methods, contributing to more robust speech verification systems.
Related papers
- Hybrid Audio Detection Using Fine-Tuned Audio Spectrogram Transformers: A Dataset-Driven Evaluation of Mixed AI-Human Speech [3.195044561824979]
We construct a novel hybrid audio dataset incorporating human, AI-generated, cloned, and mixed audio samples.<n>Our approach significantly outperforms existing baselines in mixed-audio detection, achieving 97% classification accuracy.<n>Our findings highlight the importance of hybrid datasets and tailored models in advancing the robustness of speech-based authentication systems.
arXiv Detail & Related papers (2025-05-21T05:43:41Z) - VoiceGRPO: Modern MoE Transformers with Group Relative Policy Optimization GRPO for AI Voice Health Care Applications on Voice Pathology Detection [0.07673339435080444]
This research introduces a novel AI techniques as Mixture-of-Experts Transformers with Group Relative Policy Optimization (GRPO)<n>We adopt advanced training paradigms inspired by reinforcement learning to enhance model stability and performance.<n>Experiments conducted on a synthetically generated voice pathology dataset demonstrate that our proposed models significantly improve diagnostic accuracy, F1 score, and ROC-AUC.
arXiv Detail & Related papers (2025-03-05T14:52:57Z) - Speech-Forensics: Towards Comprehensive Synthetic Speech Dataset Establishment and Analysis [21.245160899212774]
We propose the Speech-Forensics dataset by extensively covering authentic, synthetic, and partially forged speech samples.
We also propose a TEmporal Speech LocalizaTion network, called TEST, aiming at simultaneously performing authenticity detection, multiple fake segments localization, and algorithms recognition.
Our model achieves an average mAP of 83.55% and an EER of 5.25% at the utterance level.
arXiv Detail & Related papers (2024-12-12T07:48:17Z) - Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.<n>It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.<n>It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z) - ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis [13.676243543864347]
We propose an end-to-end method that is able to generate high-quality speech and better similarity for both seen and unseen speakers.
The method consists of three separately trained components: a speaker encoder based on the state-of-the-art TDNN-based ECAPA-TDNN, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder.
arXiv Detail & Related papers (2022-03-20T07:04:26Z) - Advances in Speech Vocoding for Text-to-Speech with Continuous
Parameters [2.6572330982240935]
This paper presents new techniques in a continuous vocoder, that is all features are continuous and presents a flexible speech synthesis system.
New continuous noise masking based on the phase distortion is proposed to eliminate the perceptual impact of the residual noise.
Bidirectional long short-term memory (LSTM) and gated recurrent unit (GRU) are studied and applied to model continuous parameters for more natural-sounding like a human.
arXiv Detail & Related papers (2021-06-19T12:05:01Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.