Learning utterance-level representations through token-level acoustic
latents prediction for Expressive Speech Synthesis
- URL: http://arxiv.org/abs/2211.00523v1
- Date: Tue, 1 Nov 2022 15:17:25 GMT
- Title: Learning utterance-level representations through token-level acoustic
latents prediction for Expressive Speech Synthesis
- Authors: Karolos Nikitaras, Konstantinos Klapsas, Nikolaos Ellinas, Georgia
Maniati, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris,
Pirros Tsiakoulis
- Abstract summary: We show that the fine-grained latent space also captures coarse-grained information, which is more evident as the dimension of latent space increases in order to capture diverse prosodic representations.
We alleviate this issue by first capturing rich speech attributes into a token-level latent space and then, separately train a prior network that given the input text, learns utterance-level representations in order to predict the phoneme-level, posterior latents extracted during the previous step.
- Score: 3.691712391306624
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes an Expressive Speech Synthesis model that utilizes
token-level latent prosodic variables in order to capture and control
utterance-level attributes, such as character acting voice and speaking style.
Current works aim to explicitly factorize such fine-grained and utterance-level
speech attributes into different representations extracted by modules that
operate in the corresponding level. We show that the fine-grained latent space
also captures coarse-grained information, which is more evident as the
dimension of latent space increases in order to capture diverse prosodic
representations. Therefore, a trade-off arises between the diversity of the
token-level and utterance-level representations and their disentanglement. We
alleviate this issue by first capturing rich speech attributes into a
token-level latent space and then, separately train a prior network that given
the input text, learns utterance-level representations in order to predict the
phoneme-level, posterior latents extracted during the previous step. Both
qualitative and quantitative evaluations are used to demonstrate the
effectiveness of the proposed approach. Audio samples are available in our demo
page.
Related papers
- STAB: Speech Tokenizer Assessment Benchmark [57.45234921100835]
Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text.
We present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively.
We evaluate the STAB metrics and correlate this with downstream task performance across a range of speech tasks and tokenizer choices.
arXiv Detail & Related papers (2024-09-04T02:20:59Z) - CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - Cognitive Coding of Speech [6.396288020763143]
We propose an approach for cognitive coding of speech by unsupervised extraction of contextual representations in two hierarchical levels of abstraction.
This decomposition is achieved by a two-stage neural network, with a lower and an upper stage operating at different time scales.
With an application in speech compression in mind, we investigate the effect of dimensionality reduction and low quantization on the extracted representations.
arXiv Detail & Related papers (2021-10-08T16:49:16Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - Deep Learning for Prominence Detection in Children's Read Speech [13.041607703862724]
We consider a labeled dataset of children's reading recordings for the speaker-independent detection of prominent words.
A previous well-tuned random forest ensemble predictor is replaced by an RNN sequence to exploit potential context dependency.
Deep learning is applied to obtain word-level features from low-level acoustic contours of fundamental frequency, intensity and spectral shape.
arXiv Detail & Related papers (2021-04-12T14:15:08Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.