FLEURS: Few-shot Learning Evaluation of Universal Representations of
Speech
- URL: http://arxiv.org/abs/2205.12446v1
- Date: Wed, 25 May 2022 02:29:03 GMT
- Title: FLEURS: Few-shot Learning Evaluation of Universal Representations of
Speech
- Authors: Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod,
Siddharth Dalmia, Jason Riesa, Clara Rivera, Ankur Bapna
- Abstract summary: We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark.
FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark.
- Score: 33.71744518887916
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce FLEURS, the Few-shot Learning Evaluation of Universal
Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset
in 102 languages built on top of the machine translation FLoRes-101 benchmark,
with approximately 12 hours of speech supervision per language. FLEURS can be
used for a variety of speech tasks, including Automatic Speech Recognition
(ASR), Speech Language Identification (Speech LangID), Translation and
Retrieval. In this paper, we provide baselines for the tasks based on
multilingual pre-trained models like mSLAM. The goal of FLEURS is to enable
speech technology in more languages and catalyze research in low-resource
speech understanding.
Related papers
- Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training [29.47243668154796]
BLOOMZMMS is a novel model that integrates a multilingual LLM with a multilingual speech encoder.
We demonstrate the transferability of linguistic knowledge from the text to the speech modality.
Our zero-shot evaluation results confirm the robustness of our approach across multiple tasks.
arXiv Detail & Related papers (2024-04-16T21:45:59Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [76.95115818308918]
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.
This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages.
We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
arXiv Detail & Related papers (2023-03-02T07:47:18Z) - SpeechMatrix: A Large-Scale Mined Corpus of Multilingual
Speech-to-Speech Translations [38.058120432870126]
SpeechMatrix is a large-scale multilingual corpus of speech-to-speech translations.
It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech.
arXiv Detail & Related papers (2022-11-08T19:09:27Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Multitask Training with Text Data for End-to-End Speech Recognition [45.35605825009208]
We propose a multitask training method for attention-based end-to-end speech recognition models.
We regularize the decoder in a listen, attend, and spell model by multitask training it on both audio-text and text-only data.
arXiv Detail & Related papers (2020-10-27T14:29:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.