Related papers: Speech to Speech Synthesis for Voice Impersonation

Speech to Speech Synthesis for Voice Impersonation

URL: http://arxiv.org/abs/2602.16721v1
Date: Fri, 13 Feb 2026 01:22:25 GMT
Title: Speech to Speech Synthesis for Voice Impersonation
Authors: Bjorn Johnson, Jared Levy,
Abstract summary: We propose Speech to Speech Synthesis Network (STSSN), a model based on current state of the art systems.<n>We show that our proposed model is quite powerful, and succeeds in generating realistic audio samples.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Numerous models have shown great success in the fields of speech recognition as well as speech synthesis, but models for speech to speech processing have not been heavily explored. We propose Speech to Speech Synthesis Network (STSSN), a model based on current state of the art systems that fuses the two disciplines in order to perform effective speech to speech style transfer for the purpose of voice impersonation. We show that our proposed model is quite powerful, and succeeds in generating realistic audio samples despite a number of drawbacks in its capacity. We benchmark our proposed model by comparing it with a generative adversarial model which accomplishes a similar task, and show that ours produces more convincing results.

Related papers

Alternating Approach-Putt Models for Multi-Stage Speech Enhancement [2.5016653845378722]
We propose a post-processing neural network designed to mitigate artifacts introduced by speech enhancement models.<n>We demonstrate that alternating between a speech enhancement model and the proposed Putt model leads to improved speech quality.
arXiv Detail & Related papers (2025-08-14T08:18:42Z)
Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments [0.2916558661202724]
We develop a transformer-based model that jointly performs speech recognition and speaker identification. We show that the joint model performs comparably to Whisper under clean conditions. Our results suggest that integrating voice representations with speech recognition can lead to more robust models under adversarial conditions.
arXiv Detail & Related papers (2024-10-07T18:39:59Z)
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion. We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z)
Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation [6.225927189801006]
We propose a novel framework that incorporates comprehensive modeling of both syntactic and acoustic cues that are associated with pausing patterns. Remarkably, our framework possesses the capability to consistently generate natural speech even for considerably more extended and intricate out-of-domain (OOD) sentences.
arXiv Detail & Related papers (2024-04-03T09:17:38Z)
uSee: Unified Speech Enhancement and Editing with Conditional Diffusion Models [57.71199494492223]
We propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner. Our experiments show that our proposed uSee model can achieve superior performance in both speech denoising and dereverberation compared to other related generative speech enhancement models.
arXiv Detail & Related papers (2023-10-02T04:36:39Z)
TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture. We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning. We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z)
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z)
The Ability of Self-Supervised Speech Models for Audio Representations [53.19715501273934]
Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning. We conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of state-of-the-art SSL speech models. Results show that SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets.
arXiv Detail & Related papers (2022-09-26T15:21:06Z)
Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling [13.956691231452336]
FaST-VGS is a Transformer-based model that learns to associate raw speech waveforms with semantically related images. FaST-VGS+ is learned in a multi-task fashion with a masked language modeling objective. We show that our models perform competitively on the ABX task, outperform all other concurrent submissions on the Syntactic and Semantic tasks, and nearly match the best system on the Lexical task.
arXiv Detail & Related papers (2022-02-07T22:09:54Z)
Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers. With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio. We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.