Are discrete units necessary for Spoken Language Modeling?
- URL: http://arxiv.org/abs/2203.05936v1
- Date: Fri, 11 Mar 2022 14:14:35 GMT
- Title: Are discrete units necessary for Spoken Language Modeling?
- Authors: Tu Anh Nguyen, Benoit Sagot, Emmanuel Dupoux
- Abstract summary: Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels.
We show that discretization is indeed essential for good results in spoken language modeling.
We also show that an end-to-end model trained with discrete target like HuBERT achieves similar results as the best language model trained on pseudo-text.
- Score: 10.374092717909603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work in spoken language modeling shows the possibility of learning a
language unsupervisedly from raw audio without any text labels. The approach
relies first on transforming the audio into a sequence of discrete units (or
pseudo-text) and then training a language model directly on such pseudo-text.
Is such a discrete bottleneck necessary, potentially introducing irreversible
errors in the encoding of the speech signal, or could we learn a language model
without discrete units at all? In this work, show that discretization is indeed
essential for good results in spoken language modeling, but that can omit the
discrete bottleneck if we use using discrete target features from a higher
level than the input features. We also show that an end-to-end model trained
with discrete target like HuBERT achieves similar results as the best language
model trained on pseudo-text on a set of zero-shot spoken language modeling
metrics from the Zero Resource Speech Challenge 2021.
Related papers
- SpeechAlign: Aligning Speech Generation to Human Preferences [51.684183257809075]
We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences.
We show that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model.
arXiv Detail & Related papers (2024-04-08T15:21:17Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z) - Learning Spoken Language Representations with Neural Lattice Language
Modeling [39.50831917042577]
We propose a framework that trains neural lattice language models to provide contextualized representations for spoken language understanding tasks.
The proposed two-stage pre-training approach reduces the demands of speech data and has better efficiency.
arXiv Detail & Related papers (2020-07-06T10:38:03Z) - Multilingual Jointly Trained Acoustic and Written Word Embeddings [22.63696520064212]
We extend this idea to multiple low-resource languages.
We jointly train an AWE model and an AGWE model, using phonetically transcribed data from multiple languages.
The pre-trained models can then be used for unseen zero-resource languages, or fine-tuned on data from low-resource languages.
arXiv Detail & Related papers (2020-06-24T19:16:02Z) - Towards Zero-shot Learning for Automatic Phonemic Transcription [82.9910512414173]
A more challenging problem is to build phonemic transcribers for languages with zero training data.
Our model is able to recognize unseen phonemes in the target language without any training data.
It achieves 7.7% better phoneme error rate on average over a standard multilingual model.
arXiv Detail & Related papers (2020-02-26T20:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.