Text-Free Prosody-Aware Generative Spoken Language Modeling
- URL: http://arxiv.org/abs/2109.03264v1
- Date: Tue, 7 Sep 2021 18:03:21 GMT
- Title: Text-Free Prosody-Aware Generative Spoken Language Modeling
- Authors: Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal
Lakhotia, Tu-Anh Nguyen, Morgane Rivi\`ere, Abdelrahman Mohamed, Emmanuel
Dupoux, Wei-Ning Hsu
- Abstract summary: We present a prosody-aware generative spoken language model (pGSLM)
It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms.
Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt.
- Score: 46.19240899818964
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech pre-training has primarily demonstrated efficacy on classification
tasks, while its capability of generating novel speech, similar to how GPT-2
can generate coherent paragraphs, has barely been explored. Generative Spoken
Language Modeling (GSLM) (Lakhotia et al., 2021) is the only prior work
addressing the generative aspects of speech pre-training, which replaces text
with discovered phone-like units for language modeling and shows the ability to
generate meaningful novel sentences. Unfortunately, despite eliminating the
need of text, the units used in GSLM discard most of the prosodic information.
Hence, GSLM fails to leverage prosody for better comprehension, and does not
generate expressive speech. In this work, we present a prosody-aware generative
spoken language model (pGSLM). It is composed of a multi-stream transformer
language model (MS-TLM) of speech, represented as discovered unit and prosodic
feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to
waveforms. We devise a series of metrics for prosody modeling and generation,
and re-use metrics from GSLM for content modeling. Experimental results show
that the pGSLM can utilize prosody to improve both prosody and content
modeling, and also generate natural, meaningful, and coherent speech given a
spoken prompt. Audio samples can be found at https://speechbot.github.io/pgslm.
Related papers
- SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [39.31849739010572]
We introduce textbfGenerative textbfPre-trained textbfSpeech textbfTransformer (GPST)
GPST is a hierarchical transformer designed for efficient speech language modeling.
arXiv Detail & Related papers (2024-06-03T04:16:30Z) - Spirit LM: Interleaved Spoken and Written Language Model [43.8568445216866]
We introduce Spirit LM, a foundation multimodal language model that freely mixes text and speech.
Spirit LM comes in two versions: a Base version that uses speech phonetic units (HuBERT) and an Expressive version that models expressivity using pitch and style units.
We demonstrate that Spirit LM can learn new tasks in a few-shot fashion across modalities.
arXiv Detail & Related papers (2024-02-08T15:39:32Z) - Toward Joint Language Modeling for Speech Units and Text [89.32163954508489]
We explore joint language modeling for speech units and text.
We introduce automatic metrics to evaluate how well the joint LM mixes speech and text.
Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks.
arXiv Detail & Related papers (2023-10-12T20:53:39Z) - Generative Spoken Language Model based on continuous word-sized audio
tokens [52.081868603603844]
We introduce a Generative Spoken Language Model based on word-size continuous-valued audio embeddings.
The resulting model is the first generative language model based on word-size continuous embeddings.
arXiv Detail & Related papers (2023-10-08T16:46:14Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.