Related papers: Cross-Technology Generalization in Synthesized Speech Detection: Evaluating AST Models with Modern Voice Generators

Cross-Technology Generalization in Synthesized Speech Detection: Evaluating AST Models with Modern Voice Generators

URL: http://arxiv.org/abs/2503.22503v1
Date: Fri, 28 Mar 2025 15:07:26 GMT
Title: Cross-Technology Generalization in Synthesized Speech Detection: Evaluating AST Models with Modern Voice Generators
Authors: Andrew Ustinov, Matey Yordanov, Andrei Kuchma, Mikhail Bychkov,
Abstract summary: This paper evaluates the Audio Spectrogram Transformer (AST) architecture for synthesized speech detection.<n>Using differentiated augmentation strategies, the model achieves 0.91% EER overall when tested against ElevenLabs, NotebookLM, and Minimax AI voice generators.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This paper evaluates the Audio Spectrogram Transformer (AST) architecture for synthesized speech detection, with focus on generalization across modern voice generation technologies. Using differentiated augmentation strategies, the model achieves 0.91% EER overall when tested against ElevenLabs, NotebookLM, and Minimax AI voice generators. Notably, after training with only 102 samples from a single technology, the model demonstrates strong cross-technology generalization, achieving 3.3% EER on completely unseen voice generators. This work establishes benchmarks for rapid adaptation to emerging synthesis technologies and provides evidence that transformer-based architectures can identify common artifacts across different neural voice synthesis methods, contributing to more robust speech verification systems.

Related papers

Hybrid Audio Detection Using Fine-Tuned Audio Spectrogram Transformers: A Dataset-Driven Evaluation of Mixed AI-Human Speech [3.195044561824979]
We construct a novel hybrid audio dataset incorporating human, AI-generated, cloned, and mixed audio samples.<n>Our approach significantly outperforms existing baselines in mixed-audio detection, achieving 97% classification accuracy.<n>Our findings highlight the importance of hybrid datasets and tailored models in advancing the robustness of speech-based authentication systems.
arXiv Detail & Related papers (2025-05-21T05:43:41Z)
VoiceGRPO: Modern MoE Transformers with Group Relative Policy Optimization GRPO for AI Voice Health Care Applications on Voice Pathology Detection [0.07673339435080444]
This research introduces a novel AI techniques as Mixture-of-Experts Transformers with Group Relative Policy Optimization (GRPO)<n>We adopt advanced training paradigms inspired by reinforcement learning to enhance model stability and performance.<n>Experiments conducted on a synthetically generated voice pathology dataset demonstrate that our proposed models significantly improve diagnostic accuracy, F1 score, and ROC-AUC.
arXiv Detail & Related papers (2025-03-05T14:52:57Z)
Speech-Forensics: Towards Comprehensive Synthetic Speech Dataset Establishment and Analysis [21.245160899212774]
We propose the Speech-Forensics dataset by extensively covering authentic, synthetic, and partially forged speech samples. We also propose a TEmporal Speech LocalizaTion network, called TEST, aiming at simultaneously performing authenticity detection, multiple fake segments localization, and algorithms recognition. Our model achieves an average mAP of 83.55% and an EER of 5.25% at the utterance level.
arXiv Detail & Related papers (2024-12-12T07:48:17Z)
Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.<n>It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.<n>It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets. Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z)
Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations. We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z)
Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method. We first use wav2vec pre-trained model to obtain a high-level representation of the speech. For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z)
SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z)
ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis [13.676243543864347]
We propose an end-to-end method that is able to generate high-quality speech and better similarity for both seen and unseen speakers. The method consists of three separately trained components: a speaker encoder based on the state-of-the-art TDNN-based ECAPA-TDNN, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder.
arXiv Detail & Related papers (2022-03-20T07:04:26Z)
Advances in Speech Vocoding for Text-to-Speech with Continuous Parameters [2.6572330982240935]
This paper presents new techniques in a continuous vocoder, that is all features are continuous and presents a flexible speech synthesis system. New continuous noise masking based on the phase distortion is proposed to eliminate the perceptual impact of the residual noise. Bidirectional long short-term memory (LSTM) and gated recurrent unit (GRU) are studied and applied to model continuous parameters for more natural-sounding like a human.
arXiv Detail & Related papers (2021-06-19T12:05:01Z)
Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR) We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.