The Interspeech Zero Resource Speech Challenge 2021: Spoken language
modelling
- URL: http://arxiv.org/abs/2104.14700v1
- Date: Thu, 29 Apr 2021 23:53:37 GMT
- Title: The Interspeech Zero Resource Speech Challenge 2021: Spoken language
modelling
- Authors: Ewan Dunbar, Mathieu Bernard, Nicolas Hamilakis, Tu Anh Nguyen,
Maureen de Seyssel, Patricia Roz\'e, Morgane Rivi\`ere, Eugene Kharitonov,
Emmanuel Dupoux
- Abstract summary: We present the Zero Resource Speech Challenge 2021, which asks participants to learn a language model directly from audio, without any text or labels.
The challenge is based on the Libri-light dataset, which provides up to 60k hours of audio from English audio books without any associated text.
- Score: 19.525392906001624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the Zero Resource Speech Challenge 2021, which asks participants
to learn a language model directly from audio, without any text or labels. The
challenge is based on the Libri-light dataset, which provides up to 60k hours
of audio from English audio books without any associated text. We provide a
pipeline baseline system consisting on an encoder based on contrastive
predictive coding (CPC), a quantizer ($k$-means) and a standard language model
(BERT or LSTM). The metrics evaluate the learned representations at the
acoustic (ABX discrimination), lexical (spot-the-word), syntactic
(acceptability judgment) and semantic levels (similarity judgment). We present
an overview of the eight submitted systems from four groups and discuss the
main results.
Related papers
- CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - Generative Spoken Language Model based on continuous word-sized audio
tokens [52.081868603603844]
We introduce a Generative Spoken Language Model based on word-size continuous-valued audio embeddings.
The resulting model is the first generative language model based on word-size continuous embeddings.
arXiv Detail & Related papers (2023-10-08T16:46:14Z) - Direct Text to Speech Translation System using Acoustic Units [12.36988942647101]
This paper proposes a direct text to speech translation system using discrete acoustic units.
This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language.
Results show a remarkable improvement when initialising our proposed architecture with a model pre-trained with more languages.
arXiv Detail & Related papers (2023-09-14T07:35:14Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Separate What You Describe: Language-Queried Audio Source Separation [53.65665794338574]
We introduce the task of language-queried audio source separation (LASS)
LASS aims to separate a target source from an audio mixture based on a natural language query of the target source.
We propose LASS-Net, an end-to-end neural network that is learned to jointly process acoustic and linguistic information.
arXiv Detail & Related papers (2022-03-28T23:47:57Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Jira: a Kurdish Speech Recognition System Designing and Building Speech
Corpus and Pronunciation Lexicon [4.226093500082746]
We introduce the first large vocabulary speech recognition system (LVSR) for the Central Kurdish language, named Jira.
The Kurdish language is an Indo-European language spoken by more than 30 million people in several countries.
Regarding speech corpus, we designed a sentence collection in which the ratio of di-phones in the collection resembles the real data of the Central Kurdish language.
A test set including 11 different document topics is designed and recorded in two corresponding speech conditions.
arXiv Detail & Related papers (2021-02-15T09:27:54Z) - Generative Spoken Language Modeling from Raw Audio [42.153136032037175]
Generative spoken language modeling involves learning jointly the acoustic and linguistic characteristics of a language from raw audio only (without text or labels)
We introduce metrics to automatically evaluate the generated output in terms of acoustic and linguistic quality in two associated end-to-end tasks.
We test baseline systems consisting of a discrete speech encoder (returning discrete, low, pseudo-text units), a generative language model (trained on pseudo-text units) and a speech decoder.
arXiv Detail & Related papers (2021-02-01T21:41:40Z) - The Zero Resource Speech Benchmark 2021: Metrics and baselines for
unsupervised spoken language modeling [23.517751578968344]
We introduce a new unsupervised task, spoken language modeling: the learning of linguistic representations from raw audio signals without any labels.
We present the results and analyses of a composite baseline made of self-supervised contrastive representation learning (CPC), clustering (k-means) and language modeling (LSTM or BERT)
This simple pipeline shows better than chance performance on all four metrics, demonstrating the feasibility of spoken language modeling from raw speech.
arXiv Detail & Related papers (2020-11-23T18:01:37Z) - The Zero Resource Speech Challenge 2020: Discovering discrete subword
and word units [40.41406551797358]
Zero Resource Speech Challenge 2020 aims at learning speech representations from raw audio signals without any labels.
We present the results of the twenty submitted models and discuss the implications of the main findings for unsupervised speech learning.
arXiv Detail & Related papers (2020-10-12T18:56:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.