Controllable Speaking Styles Using a Large Language Model
- URL: http://arxiv.org/abs/2305.10321v2
- Date: Tue, 19 Sep 2023 16:35:57 GMT
- Title: Controllable Speaking Styles Using a Large Language Model
- Authors: Atli Thor Sigurgeirsson, Simon King
- Abstract summary: Text-to-Speech (TTS) models can generate multiple, prosodically-different renditions of the same target text.
Currently, controlling these models during inference typically requires finding an appropriate reference utterance.
Here, we give two demonstrations: control of speaking style; prosody appropriate for a given dialogue context.
- Score: 13.642358232817342
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reference-based Text-to-Speech (TTS) models can generate multiple,
prosodically-different renditions of the same target text. Such models jointly
learn a latent acoustic space during training, which can be sampled from during
inference. Controlling these models during inference typically requires finding
an appropriate reference utterance, which is non-trivial.
Large generative language models (LLMs) have shown excellent performance in
various language-related tasks. Given only a natural language query text (the
prompt), such models can be used to solve specific, context-dependent tasks.
Recent work in TTS has attempted similar prompt-based control of novel speaking
style generation. Those methods do not require a reference utterance and can,
under ideal conditions, be controlled with only a prompt. But existing methods
typically require a prompt-labelled speech corpus for jointly training a
prompt-conditioned encoder.
In contrast, we instead employ an LLM to directly suggest prosodic
modifications for a controllable TTS model, using contextual information
provided in the prompt. The prompt can be designed for a multitude of tasks.
Here, we give two demonstrations: control of speaking style; prosody
appropriate for a given dialogue context. The proposed method is rated most
appropriate in 50% of cases vs. 31% for a baseline model.
Related papers
- TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z) - Stabilized In-Context Learning with Pre-trained Language Models for Few
Shot Dialogue State Tracking [57.92608483099916]
Large pre-trained language models (PLMs) have shown impressive unaided performance across many NLP tasks.
For more complex tasks such as dialogue state tracking (DST), designing prompts that reliably convey the desired intent is nontrivial.
We introduce a saliency model to limit dialogue text length, allowing us to include more exemplars per query.
arXiv Detail & Related papers (2023-02-12T15:05:10Z) - Don't Prompt, Search! Mining-based Zero-Shot Learning with Language
Models [37.8952605358518]
Masked language models like BERT can perform text classification in a zero-shot fashion.
We propose an alternative mining-based approach for zero-shot learning.
arXiv Detail & Related papers (2022-10-26T15:52:30Z) - Few-shot Prompting Towards Controllable Response Generation [49.479958672988566]
We first explored the combination of prompting and reinforcement learning (RL) to steer models' generation without accessing any of the models' parameters.
We apply multi-task learning to make the model learn to generalize to new tasks better.
Experiment results show that our proposed method can successfully control several state-of-the-art (SOTA) dialogue models without accessing their parameters.
arXiv Detail & Related papers (2022-06-08T14:48:06Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Guided-TTS:Text-to-Speech with Untranscribed Speech [22.548875263927396]
We present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data.
For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms.
arXiv Detail & Related papers (2021-11-23T10:05:05Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Language Models as Few-Shot Learner for Task-Oriented Dialogue Systems [74.8759568242933]
Task-oriented dialogue systems use four connected modules, namely, Natural Language Understanding (NLU), a Dialogue State Tracking (DST), Dialogue Policy (DP) and Natural Language Generation (NLG)
A research challenge is to learn each module with the least amount of samples given the high cost related to the data collection.
We evaluate the priming few-shot ability of language models in the NLU, DP and NLG tasks.
arXiv Detail & Related papers (2020-08-14T08:23:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.