Generative Spoken Language Modeling from Raw Audio
- URL: http://arxiv.org/abs/2102.01192v1
- Date: Mon, 1 Feb 2021 21:41:40 GMT
- Title: Generative Spoken Language Modeling from Raw Audio
- Authors: Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam
Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman
Mohamed, Emmanuel Dupoux
- Abstract summary: Generative spoken language modeling involves learning jointly the acoustic and linguistic characteristics of a language from raw audio only (without text or labels)
We introduce metrics to automatically evaluate the generated output in terms of acoustic and linguistic quality in two associated end-to-end tasks.
We test baseline systems consisting of a discrete speech encoder (returning discrete, low, pseudo-text units), a generative language model (trained on pseudo-text units) and a speech decoder.
- Score: 42.153136032037175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative spoken language modeling involves learning jointly the acoustic
and linguistic characteristics of a language from raw audio only (without text
or labels). We introduce metrics to automatically evaluate the generated output
in terms of acoustic and linguistic quality in two associated end-to-end tasks,
respectively: speech resynthesis (repeating the speech input using the system's
own voice), and speech generation (producing novel speech outputs conditional
on a spoken prompt, or unconditionally), and validate these metrics with human
judgment. We test baseline systems consisting of a discrete speech encoder
(returning discrete, low bitrate, pseudo-text units), a generative language
model (trained on pseudo-text units), and a speech decoder (generating a
waveform from pseudo-text). By comparing three state-of-the-art unsupervised
speech encoders (Contrastive Predictive Coding (CPC), wav2vec 2.0, HuBERT), and
varying the number of discrete units (50, 100, 200), we investigate how the
generative performance depends on the quality of the learned units as measured
by unsupervised metrics (zero-shot probe tasks). We will open source our
evaluation stack and baseline models.
Related papers
- Generative Spoken Language Model based on continuous word-sized audio
tokens [52.081868603603844]
We introduce a Generative Spoken Language Model based on word-size continuous-valued audio embeddings.
The resulting model is the first generative language model based on word-size continuous embeddings.
arXiv Detail & Related papers (2023-10-08T16:46:14Z) - Zero Resource Code-switched Speech Benchmark Using Speech Utterance Pairs For Multiple Spoken Languages [49.6922490267701]
We introduce a new zero resource code-switched speech benchmark designed to assess the code-switching capabilities of self-supervised speech encoders.
We showcase a baseline system of language modeling on discrete units to demonstrate how the code-switching abilities of speech encoders can be assessed.
arXiv Detail & Related papers (2023-10-04T17:58:11Z) - Direct Text to Speech Translation System using Acoustic Units [12.36988942647101]
This paper proposes a direct text to speech translation system using discrete acoustic units.
This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language.
Results show a remarkable improvement when initialising our proposed architecture with a model pre-trained with more languages.
arXiv Detail & Related papers (2023-09-14T07:35:14Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech [3.42658286826597]
We introduce an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation.
Our model is shown to effectively share information across languages and according to a subjective evaluation test, it produces more natural and accurate code-switching speech than the baselines.
arXiv Detail & Related papers (2020-08-03T10:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.