The Zero Resource Speech Benchmark 2021: Metrics and baselines for
unsupervised spoken language modeling
- URL: http://arxiv.org/abs/2011.11588v2
- Date: Tue, 1 Dec 2020 15:53:57 GMT
- Title: The Zero Resource Speech Benchmark 2021: Metrics and baselines for
unsupervised spoken language modeling
- Authors: Tu Anh Nguyen, Maureen de Seyssel, Patricia Roz\'e, Morgane Rivi\`ere,
Evgeny Kharitonov, Alexei Baevski, Ewan Dunbar, Emmanuel Dupoux
- Abstract summary: We introduce a new unsupervised task, spoken language modeling: the learning of linguistic representations from raw audio signals without any labels.
We present the results and analyses of a composite baseline made of self-supervised contrastive representation learning (CPC), clustering (k-means) and language modeling (LSTM or BERT)
This simple pipeline shows better than chance performance on all four metrics, demonstrating the feasibility of spoken language modeling from raw speech.
- Score: 23.517751578968344
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a new unsupervised task, spoken language modeling: the learning
of linguistic representations from raw audio signals without any labels, along
with the Zero Resource Speech Benchmark 2021: a suite of 4 black-box, zero-shot
metrics probing for the quality of the learned models at 4 linguistic levels:
phonetics, lexicon, syntax and semantics. We present the results and analyses
of a composite baseline made of the concatenation of three unsupervised
systems: self-supervised contrastive representation learning (CPC), clustering
(k-means) and language modeling (LSTM or BERT). The language models learn on
the basis of the pseudo-text derived from clustering the learned
representations. This simple pipeline shows better than chance performance on
all four metrics, demonstrating the feasibility of spoken language modeling
from raw speech. It also yields worse performance compared to text-based
'topline' systems trained on the same data, delineating the space to be
explored by more sophisticated end-to-end models.
Related papers
- SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units.
Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering.
SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z) - Generative Spoken Language Model based on continuous word-sized audio
tokens [52.081868603603844]
We introduce a Generative Spoken Language Model based on word-size continuous-valued audio embeddings.
The resulting model is the first generative language model based on word-size continuous embeddings.
arXiv Detail & Related papers (2023-10-08T16:46:14Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Bidirectional Representations for Low Resource Spoken Language
Understanding [39.208462511430554]
We propose a representation model to encode speech in bidirectional rich encodings.
The approach uses a masked language modelling objective to learn the representations.
We show that the performance of the resulting encodings is better than comparable models on multiple datasets.
arXiv Detail & Related papers (2022-11-24T17:05:16Z) - BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and
Semantic Parsing [55.058258437125524]
We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing.
We benchmark eight language models, including two GPT-3 variants available only through an API.
Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.
arXiv Detail & Related papers (2022-06-21T18:34:11Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - The Interspeech Zero Resource Speech Challenge 2021: Spoken language
modelling [19.525392906001624]
We present the Zero Resource Speech Challenge 2021, which asks participants to learn a language model directly from audio, without any text or labels.
The challenge is based on the Libri-light dataset, which provides up to 60k hours of audio from English audio books without any associated text.
arXiv Detail & Related papers (2021-04-29T23:53:37Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - Learning Spoken Language Representations with Neural Lattice Language
Modeling [39.50831917042577]
We propose a framework that trains neural lattice language models to provide contextualized representations for spoken language understanding tasks.
The proposed two-stage pre-training approach reduces the demands of speech data and has better efficiency.
arXiv Detail & Related papers (2020-07-06T10:38:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.