BERT, can HE predict contrastive focus? Predicting and controlling
prominence in neural TTS using a language model
- URL: http://arxiv.org/abs/2207.01718v1
- Date: Mon, 4 Jul 2022 20:43:41 GMT
- Title: BERT, can HE predict contrastive focus? Predicting and controlling
prominence in neural TTS using a language model
- Authors: Brooke Stephenson, Laurent Besacier, Laurent Girin, Thomas Hueber
- Abstract summary: We evaluate the accuracy of a BERT model, finetuned to predict quantized acoustic prominence features, on utterances containing contrastive focus.
We also evaluate the controllability of pronoun prominence in a TTS model conditioned on acoustic prominence features.
- Score: 29.188684861193092
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Several recent studies have tested the use of transformer language model
representations to infer prosodic features for text-to-speech synthesis (TTS).
While these studies have explored prosody in general, in this work, we look
specifically at the prediction of contrastive focus on personal pronouns. This
is a particularly challenging task as it often requires semantic, discursive
and/or pragmatic knowledge to predict correctly. We collect a corpus of
utterances containing contrastive focus and we evaluate the accuracy of a BERT
model, finetuned to predict quantized acoustic prominence features, on these
samples. We also investigate how past utterances can provide relevant
information for this prediction. Furthermore, we evaluate the controllability
of pronoun prominence in a TTS model conditioned on acoustic prominence
features.
Related papers
- An investigation of speaker independent phrase break models in
End-to-End TTS systems [0.0]
We evaluate the utility and effectiveness of phrase break prediction models in an end-to-end TTS system.
We show by means of perceptual listening evaluations that there is a clear preference for stories synthesized after predicting the location of phrase breaks.
arXiv Detail & Related papers (2023-04-09T04:26:58Z) - Duration-aware pause insertion using pre-trained language model for
multi-speaker text-to-speech [40.65850332919397]
We propose more powerful pause insertion frameworks based on a pre-trained language model.
Our approach uses bidirectional encoder representations from transformers (BERT) pre-trained on a large-scale text corpus.
We also leverage duration-aware pause insertion for more natural multi-speaker TTS.
arXiv Detail & Related papers (2023-02-27T10:40:41Z) - Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks.
In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
arXiv Detail & Related papers (2022-06-09T15:39:19Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in
Text-to-Speech [96.0009517132463]
We introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV)
We then introduce an LPV predictor, which predicts LPV given word sequence and fine-tune it on the high-quality TTS dataset.
Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.
arXiv Detail & Related papers (2022-02-16T01:42:32Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - On the Interplay Between Sparsity, Naturalness, Intelligibility, and
Prosody in Speech Synthesis [102.80458458550999]
We investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech.
Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility.
arXiv Detail & Related papers (2021-10-04T02:03:28Z) - Did the Cat Drink the Coffee? Challenging Transformers with Generalized
Event Knowledge [59.22170796793179]
Transformers Language Models (TLMs) were tested on a benchmark for the textitdynamic estimation of thematic fit
Our results show that TLMs can reach performances that are comparable to those achieved by SDM.
However, additional analysis consistently suggests that TLMs do not capture important aspects of event knowledge.
arXiv Detail & Related papers (2021-07-22T20:52:26Z) - Exploring BERT's Sensitivity to Lexical Cues using Tests from Semantic
Priming [8.08493736237816]
We present a case study analyzing the pre-trained BERT model with tests informed by semantic priming.
We find that BERT too shows "priming," predicting a word with greater probability when the context includes a related word versus an unrelated one.
Follow-up analysis shows BERT to be increasingly distracted by related prime words as context becomes more informative.
arXiv Detail & Related papers (2020-10-06T20:30:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.