Generative Spoken Language Model based on continuous word-sized audio
tokens
- URL: http://arxiv.org/abs/2310.05224v1
- Date: Sun, 8 Oct 2023 16:46:14 GMT
- Title: Generative Spoken Language Model based on continuous word-sized audio
tokens
- Authors: Robin Algayres, Yossi Adi, Tu Anh Nguyen, Jade Copet, Gabriel
Synnaeve, Benoit Sagot, Emmanuel Dupoux
- Abstract summary: We introduce a Generative Spoken Language Model based on word-size continuous-valued audio embeddings.
The resulting model is the first generative language model based on word-size continuous embeddings.
- Score: 52.081868603603844
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In NLP, text language models based on words or subwords are known to
outperform their character-based counterparts. Yet, in the speech community,
the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter
than a phoneme). Taking inspiration from word-based LM, we introduce a
Generative Spoken Language Model (GSLM) based on word-size continuous-valued
audio embeddings that can generate diverse and expressive language output. This
is obtained by replacing lookup table for lexical types with a Lexical
Embedding function, the cross entropy loss by a contrastive loss, and
multinomial sampling by k-NN sampling. The resulting model is the first
generative language model based on word-size continuous embeddings. Its
performance is on par with discrete unit GSLMs regarding generation quality as
measured by automatic metrics and subjective human judgements. Moreover, it is
five times more memory efficient thanks to its large 200ms units. In addition,
the embeddings before and after the Lexical Embedder are phonetically and
semantically interpretable.
Related papers
- SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units.
Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering.
SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator [114.8954615026781]
We propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator.
GanLM is trained with two pre-training objectives: replaced token detection and replaced token denoising.
Experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models.
arXiv Detail & Related papers (2022-12-20T12:51:11Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Text-Free Prosody-Aware Generative Spoken Language Modeling [46.19240899818964]
We present a prosody-aware generative spoken language model (pGSLM)
It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms.
Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt.
arXiv Detail & Related papers (2021-09-07T18:03:21Z) - Generative Spoken Language Modeling from Raw Audio [42.153136032037175]
Generative spoken language modeling involves learning jointly the acoustic and linguistic characteristics of a language from raw audio only (without text or labels)
We introduce metrics to automatically evaluate the generated output in terms of acoustic and linguistic quality in two associated end-to-end tasks.
We test baseline systems consisting of a discrete speech encoder (returning discrete, low, pseudo-text units), a generative language model (trained on pseudo-text units) and a speech decoder.
arXiv Detail & Related papers (2021-02-01T21:41:40Z) - The Zero Resource Speech Benchmark 2021: Metrics and baselines for
unsupervised spoken language modeling [23.517751578968344]
We introduce a new unsupervised task, spoken language modeling: the learning of linguistic representations from raw audio signals without any labels.
We present the results and analyses of a composite baseline made of self-supervised contrastive representation learning (CPC), clustering (k-means) and language modeling (LSTM or BERT)
This simple pipeline shows better than chance performance on all four metrics, demonstrating the feasibility of spoken language modeling from raw speech.
arXiv Detail & Related papers (2020-11-23T18:01:37Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.