An analysis on the effects of speaker embedding choice in non
auto-regressive TTS
- URL: http://arxiv.org/abs/2307.09898v1
- Date: Wed, 19 Jul 2023 10:57:54 GMT
- Title: An analysis on the effects of speaker embedding choice in non
auto-regressive TTS
- Authors: Adriana Stan and Johannah O'Mahony
- Abstract summary: We introduce a first attempt on understanding how a non-autoregressive factorised multi-speaker speech synthesis architecture exploits the information present in different speaker embedding sets.
We show that, regardless of the used set of embeddings and learning strategy, the network can handle various speaker identities equally well.
- Score: 4.619541348328938
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper we introduce a first attempt on understanding how a
non-autoregressive factorised multi-speaker speech synthesis architecture
exploits the information present in different speaker embedding sets. We
analyse if jointly learning the representations, and initialising them from
pretrained models determine any quality improvements for target speaker
identities. In a separate analysis, we investigate how the different sets of
embeddings impact the network's core speech abstraction (i.e. zero conditioned)
in terms of speaker identity and representation learning. We show that,
regardless of the used set of embeddings and learning strategy, the network can
handle various speaker identities equally well, with barely noticeable
variations in speech output quality, and that speaker leakage within the core
structure of the synthesis system is inevitable in the standard training
procedures adopted thus far.
Related papers
- Self-Supervised Disentangled Representation Learning for Robust Target
Speech Extraction [18.63245027392657]
Speech signals are inherently complex as they encompass both global acoustic characteristics and local semantic information.
In the task of target speech extraction, certain elements of global and local semantic information in the reference speech can lead to speaker confusion.
We propose a self-supervised disentangled representation learning method to overcome this challenge.
arXiv Detail & Related papers (2023-12-16T03:48:24Z) - Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks.
Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks.
Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Revisiting Conversation Discourse for Dialogue Disentanglement [88.3386821205896]
We propose enhancing dialogue disentanglement by taking full advantage of the dialogue discourse characteristics.
We develop a structure-aware framework to integrate the rich structural features for better modeling the conversational semantic context.
Our work has great potential to facilitate broader multi-party multi-thread dialogue applications.
arXiv Detail & Related papers (2023-06-06T19:17:47Z) - Exploring Speaker-Related Information in Spoken Language Understanding
for Better Speaker Diarization [7.673971221635779]
We propose methods to extract speaker-related information from semantic content in multi-party meetings.
Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems.
arXiv Detail & Related papers (2023-05-22T11:14:19Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Pre-Finetuning for Few-Shot Emotional Speech Recognition [61.463533069294414]
We view speaker adaptation as a few-shot learning problem.
We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives.
arXiv Detail & Related papers (2023-02-24T22:38:54Z) - Residual Information in Deep Speaker Embedding Architectures [4.619541348328938]
This paper introduces an analysis over six sets of speaker embeddings extracted with some of the most recent and high-performing DNN architectures.
The dataset includes 46 speakers uttering the same set of prompts, recorded in either a professional studio or their home environments.
The results show that the discriminative power of the analyzed embeddings is very high, yet across all the analyzed architectures, residual information is still present in the representations.
arXiv Detail & Related papers (2023-02-06T12:37:57Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.