Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units
- URL: http://arxiv.org/abs/2111.00610v1
- Date: Sun, 31 Oct 2021 22:48:30 GMT
- Title: Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units
- Authors: Anurag Katakkar, Alan W Black
- Abstract summary: We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
- Score: 56.52704348773307
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models (LMs) for text data have been studied extensively for their
usefulness in language generation and other downstream tasks. However, language
modelling purely in the speech domain is still a relatively unexplored topic,
with traditional speech LMs often depending on auxiliary text LMs for learning
distributional aspects of the language. For the English language, these LMs
treat words as atomic units, which presents inherent challenges to language
modelling in the speech domain. In this paper, we propose a novel LSTM-based
generative speech LM that is inspired by the CBOW model and built on linguistic
units including syllables and phonemes. This offers better acoustic consistency
across utterances in the dataset, as opposed to single melspectrogram frames,
or whole words. With a limited dataset, orders of magnitude smaller than that
required by contemporary generative models, our model closely approximates
babbling speech. We show the effect of training with auxiliary text LMs,
multitask learning objectives, and auxiliary articulatory features. Through our
experiments, we also highlight some well known, but poorly documented
challenges in training generative speech LMs, including the mismatch between
the supervised learning objective with which these models are trained such as
Mean Squared Error (MSE), and the true objective, which is speech quality. Our
experiments provide an early indication that while validation loss and Mel
Cepstral Distortion (MCD) are not strongly correlated with generated speech
quality, traditional text language modelling metrics like perplexity and
next-token-prediction accuracy might be.
Related papers
- Train & Constrain: Phonologically Informed Tongue-Twister Generation from Topics and Paraphrases [24.954896926774627]
We present a pipeline for generating phonologically informed tongue-twisters from Large Language Models (LLMs)
We also present the results of automatic and human evaluation of smaller models trained on our generated dataset.
We introduce a Phoneme-Aware Constrained Decoding module (PACD) that can be integrated into any causal language model.
arXiv Detail & Related papers (2024-03-20T18:13:17Z) - Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - Teach me with a Whisper: Enhancing Large Language Models for Analyzing
Spoken Transcripts using Speech Embeddings [8.660203441911554]
We propose a methodology for training language models leveraging spoken language audio data.
This leads to an improved language model for analyzing spoken transcripts while avoiding an audio processing overhead at test time.
In our experiments, the student model achieves consistent improvement over traditional language models on tasks analyzing spoken transcripts.
arXiv Detail & Related papers (2023-11-13T01:53:12Z) - Toward Joint Language Modeling for Speech Units and Text [89.32163954508489]
We explore joint language modeling for speech units and text.
We introduce automatic metrics to evaluate how well the joint LM mixes speech and text.
Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks.
arXiv Detail & Related papers (2023-10-12T20:53:39Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - The Interpreter Understands Your Meaning: End-to-end Spoken Language
Understanding Aided by Speech Translation [13.352795145385645]
Speech translation (ST) is a good means of pretraining speech models for end-to-end spoken language understanding.
We show that our models reach higher performance over baselines on monolingual and multilingual intent classification.
We also create new benchmark datasets for speech summarization and low-resource/zero-shot transfer from English to French or Spanish.
arXiv Detail & Related papers (2023-05-16T17:53:03Z) - Text-Free Prosody-Aware Generative Spoken Language Modeling [46.19240899818964]
We present a prosody-aware generative spoken language model (pGSLM)
It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms.
Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt.
arXiv Detail & Related papers (2021-09-07T18:03:21Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.