ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis
- URL: http://arxiv.org/abs/2311.11745v2
- Date: Fri, 31 May 2024 07:57:13 GMT
- Title: ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis
- Authors: Jungil Kong, Junmo Lee, Jeongmin Kim, Beomjeong Kim, Jihoon Park, Dohee Kong, Changheon Lee, Sangjin Kim,
- Abstract summary: We propose a novel method for modeling numerous speakers.
It enables expressing the overall characteristics of speakers in detail like a trained multi-speaker model.
- Score: 5.824018496599849
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose a novel method for modeling numerous speakers, which enables expressing the overall characteristics of speakers in detail like a trained multi-speaker model without additional training on the target speaker's dataset. Although various works with similar purposes have been actively studied, their performance has not yet reached that of trained multi-speaker models due to their fundamental limitations. To overcome previous limitations, we propose effective methods for feature learning and representing target speakers' speech characteristics by discretizing the features and conditioning them to a speech synthesis model. Our method obtained a significantly higher similarity mean opinion score (SMOS) in subjective similarity evaluation than seen speakers of a high-performance multi-speaker model, even with unseen speakers. The proposed method also outperforms a zero-shot method by significant margins. Furthermore, our method shows remarkable performance in generating new artificial speakers. In addition, we demonstrate that the encoded latent features are sufficiently informative to reconstruct an original speaker's speech completely. It implies that our method can be used as a general methodology to encode and reconstruct speakers' characteristics in various tasks.
Related papers
- Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization [25.213694510527436]
Most existing speaker diarization systems rely exclusively on unimodal acoustic information.
We propose a novel multimodal approach that jointly utilizes audio, visual, and semantic cues to enhance speaker diarization.
Our approach consistently outperforms state-of-the-art speaker diarization methods.
arXiv Detail & Related papers (2024-08-22T03:34:03Z) - We Need Variations in Speech Synthesis: Sub-center Modelling for Speaker Embeddings [47.2515056854372]
In speech synthesis, modeling of rich emotions and prosodic variations present in human voice are crucial to synthesize natural speech.
We propose a novel speaker embedding network which utilizes multiple class centers in the speaker classification training rather than a single class center as traditional embeddings.
arXiv Detail & Related papers (2024-07-05T06:54:24Z) - Speaker Verification in Agent-Generated Conversations [47.6291644653831]
The recent success of large language models (LLMs) has attracted widespread interest to develop role-playing conversational agents personalized to the characteristics and styles of different speakers to enhance their abilities to perform both general and special purpose dialogue tasks.
This study introduces a novel evaluation challenge: speaker verification in agent-generated conversations, which aimed to verify whether two sets of utterances originate from the same speaker.
arXiv Detail & Related papers (2024-05-16T14:46:18Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Self supervised learning for robust voice cloning [3.7989740031754806]
We use features learned in a self-supervised framework to produce high quality speech representations.
The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture.
This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice.
arXiv Detail & Related papers (2022-04-07T13:05:24Z) - Speaker Adaption with Intuitive Prosodic Features for Statistical
Parametric Speech Synthesis [50.5027550591763]
We propose a method of speaker adaption with intuitive prosodic features for statistical parametric speech synthesis.
The intuitive prosodic features are extracted at utterance-level or speaker-level, and are further integrated into the existing speaker-encoding-based and speaker-embedding-based adaptation frameworks respectively.
arXiv Detail & Related papers (2022-03-02T09:00:31Z) - GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech
Synthesis [6.632254395574993]
GANSpeech is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model.
In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models.
arXiv Detail & Related papers (2021-06-29T08:15:30Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.