Related papers: Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection

Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection

URL: http://arxiv.org/abs/2210.17222v1
Date: Mon, 31 Oct 2022 11:03:03 GMT
Title: Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection
Authors: Luigi Attorresi, Davide Salvi, Clara Borrelli, Paolo Bestagini, Stefano Tubaro
Abstract summary: We present a novel approach for synthetic speech detection, exploiting the combination of two high-level semantic properties of the human voice. On one side, we focus on speaker identity cues and represent them as speaker embeddings extracted using a state-of-the-art method for the automatic speaker verification task. On the other side, voice prosody, intended as variations in rhythm, pitch or accent in speech, is extracted through a specialized encoder.
Score: 15.884911752869437
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid spread of media content synthesis technology and the potentially damaging impact of audio and video deepfakes on people's lives have raised the need to implement systems able to detect these forgeries automatically. In this work we present a novel approach for synthetic speech detection, exploiting the combination of two high-level semantic properties of the human voice. On one side, we focus on speaker identity cues and represent them as speaker embeddings extracted using a state-of-the-art method for the automatic speaker verification task. On the other side, voice prosody, intended as variations in rhythm, pitch or accent in speech, is extracted through a specialized encoder. We show that the combination of these two embeddings fed to a supervised binary classifier allows the detection of deepfake speech generated with both Text-to-Speech and Voice Conversion techniques. Our results show improvements over the considered baselines, good generalization properties over multiple datasets and robustness to audio compression.

Related papers

AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation [55.607230723223346]
This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges.<n>We explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking.<n>We introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on
arXiv Detail & Related papers (2025-07-17T00:39:18Z)
RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling [3.0550455962720764]
Lip-to-speech (L2S) synthesis, which reconstructs speech from visual cues, faces challenges in accuracy and naturalness.<n>We propose RESOUND, a novel L2S system that generates intelligible and expressive speech from silent talking face videos.
arXiv Detail & Related papers (2025-05-28T06:46:13Z)
Exposing Synthetic Speech: Model Attribution and Detection of AI-generated Speech via Audio Fingerprints [11.703509488782345]
We introduce a training-free, yet effective approach for detecting AI-generated speech.<n>We tackle three key tasks: (1) single-model attribution in an open-world setting, (2) multi-model attribution in a closed-world setting, and (3) detection of synthetic versus real speech.
arXiv Detail & Related papers (2024-11-21T10:55:49Z)
Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows. Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues. We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z)
DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding [52.84475402151201]
We present a vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique. We further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video. Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.
arXiv Detail & Related papers (2023-08-15T14:07:41Z)
EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis. This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z)
Speaker Anonymization with Phonetic Intermediate Representations [22.84840887071428]
We propose a speaker anonymization pipeline that generates speech conditioned on phonetic transcriptions and anonymized speaker embeddings. Using phones as the intermediate representation ensures near complete elimination of speaker identity information from the input.
arXiv Detail & Related papers (2022-07-11T13:02:08Z)
Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task. This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z)
ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis [13.676243543864347]
We propose an end-to-end method that is able to generate high-quality speech and better similarity for both seen and unseen speakers. The method consists of three separately trained components: a speaker encoder based on the state-of-the-art TDNN-based ECAPA-TDNN, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder.
arXiv Detail & Related papers (2022-03-20T07:04:26Z)
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations [49.48053138928408]
We propose using self-supervised discrete representations for the task of speech resynthesis. We extract low-bitrate representations for speech content, prosodic information, and speaker identity. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
arXiv Detail & Related papers (2021-04-01T09:20:33Z)
FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0. FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance. This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z)
From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint [11.982748481062542]
This paper presents a system involving feedback constraint for multispeaker speech synthesis. We manage to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verification network. The model is trained and evaluated on publicly available datasets.
arXiv Detail & Related papers (2020-05-10T06:11:37Z)
Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks. To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain. The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.