Learning Robust and Multilingual Speech Representations
- URL: http://arxiv.org/abs/2001.11128v1
- Date: Wed, 29 Jan 2020 23:24:56 GMT
- Title: Learning Robust and Multilingual Speech Representations
- Authors: Kazuya Kawakami, Luyu Wang, Chris Dyer, Phil Blunsom, Aaron van den
Oord
- Abstract summary: We learn representations from up to 8000 hours of diverse and noisy speech data.
We evaluate the representations by looking at their robustness to domain shifts and their ability to improve recognition performance in many languages.
- Score: 38.34632996576116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised speech representation learning has shown remarkable success at
finding representations that correlate with phonetic structures and improve
downstream speech recognition performance. However, most research has been
focused on evaluating the representations in terms of their ability to improve
the performance of speech recognition systems on read English (e.g. Wall Street
Journal and LibriSpeech). This evaluation methodology overlooks two important
desiderata that speech representations should have: robustness to domain shifts
and transferability to other languages. In this paper we learn representations
from up to 8000 hours of diverse and noisy speech data and evaluate the
representations by looking at their robustness to domain shifts and their
ability to improve recognition performance in many languages. We find that our
representations confer significant robustness advantages to the resulting
recognition systems: we see significant improvements in out-of-domain transfer
relative to baseline feature sets and the features likewise provide
improvements in 25 phonetically diverse languages including tonal languages and
low-resource languages.
Related papers
- Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis [7.751856268560216]
This paper examines how linguistic similarity affects cross-lingual phonetic representation in speech processing for low-resource languages.
Using phonologically similar languages consistently achieves a relative improvement of 55.6% over monolingual training.
arXiv Detail & Related papers (2025-01-12T13:29:24Z) - Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling [50.62091603179394]
Whisper, one of the most advanced ASR models, handles 99 languages effectively.
However, Whisper struggles with unseen languages, those not included in its pre-training.
We propose methods that exploit these relationships to enhance ASR performance on unseen languages.
arXiv Detail & Related papers (2024-12-21T04:05:43Z) - DASB -- Discrete Audio and Speech Benchmark [12.02056212008393]
We release the Discrete Audio and Speech Benchmark (DASB), a leaderboard for benchmarking discrete audio tokens across a range of tasks.
Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks.
However, the performance gap between semantic tokens and standard continuous representations remains substantial.
arXiv Detail & Related papers (2024-06-20T13:23:27Z) - Exploring the Benefits of Tokenization of Discrete Acoustic Units [4.591279524925446]
Tokenization algorithms merge the units of a base vocabulary into larger, variable-rate units.
We demonstrate that tokenization yields significant improvements in terms of performance, as well as training and inference speed.
arXiv Detail & Related papers (2024-06-08T18:34:28Z) - Mitigating the Linguistic Gap with Phonemic Representations for Robust Cross-lingual Transfer [26.014079273740485]
Approaches to improving multilingual language understanding often struggle with significant performance gaps between high-resource and low-resource languages.
We present experiments on three representative cross-lingual tasks on 12 languages in total.
Phonemic representations exhibit higher similarities between languages compared to orthographic representations.
arXiv Detail & Related papers (2024-02-22T04:41:52Z) - The Effect of Spoken Language on Speech Enhancement using
Self-Supervised Speech Representation Loss Functions [21.237026538221404]
This work looks at the relationship between the language of the audio used to train self-supervised representation and that used to train the SE system.
Enhancement models trained using a loss function which incorporates a self-supervised representation that shares exactly the language of the noisy data used to train the SE system show better performance than those which do not match exactly.
It is found that the training language of the self-supervised representation appears to have a minor effect on enhancement performance.
arXiv Detail & Related papers (2023-07-27T09:20:38Z) - Label Aware Speech Representation Learning For Language Identification [49.197215416945596]
We propose a novel framework of combining self-supervised representation learning with the language label information for the pre-training task.
This framework, termed as Label Aware Speech Representation (LASR) learning, uses a triplet based objective function to incorporate language labels along with the self-supervised loss function.
arXiv Detail & Related papers (2023-06-07T12:14:16Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - Leveraging neural representations for facilitating access to
untranscribed speech from endangered languages [10.61744395262441]
We use data selected from 7 Australian Aboriginal languages and a regional variety of Dutch.
We find that representations from the middle layers of the wav2vec 2.0 Transformer offer large gains in task performance.
While features extracted using the pre-trained English model yielded improved detection on all the evaluation languages, better detection performance was associated with the evaluation language's phonological similarity to English.
arXiv Detail & Related papers (2021-03-26T16:44:08Z) - Self-Supervised Representations Improve End-to-End Speech Translation [57.641761472372814]
We show that self-supervised pre-trained features can consistently improve the translation performance.
Cross-lingual transfer allows to extend to a variety of languages without or with little tuning.
arXiv Detail & Related papers (2020-06-22T10:28:38Z) - Meta-Transfer Learning for Code-Switched Speech Recognition [72.84247387728999]
We propose a new learning method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting.
Our model learns to recognize individual languages, and transfer them so as to better recognize mixed-language speech by conditioning the optimization on the code-switching data.
arXiv Detail & Related papers (2020-04-29T14:27:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.