Joint Prediction and Denoising for Large-scale Multilingual
Self-supervised Learning
- URL: http://arxiv.org/abs/2309.15317v2
- Date: Thu, 28 Sep 2023 02:28:16 GMT
- Title: Joint Prediction and Denoising for Large-scale Multilingual
Self-supervised Learning
- Authors: William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang,
Yifan Peng, Xuankai Chang, Soumi Maiti, Shinji Watanabe
- Abstract summary: We show that more powerful techniques can lead to more efficient pre-training, opening SSL to more research groups.
We propose WavLabLM, which extends WavLM's joint prediction and denoising to 40k hours of data across 136 languages.
We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data.
- Score: 69.77973092264338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multilingual self-supervised learning (SSL) has often lagged behind
state-of-the-art (SOTA) methods due to the expenses and complexity required to
handle many languages. This further harms the reproducibility of SSL, which is
already limited to few research groups due to its resource usage. We show that
more powerful techniques can actually lead to more efficient pre-training,
opening SSL to more research groups. We propose WavLabLM, which extends WavLM's
joint prediction and denoising to 40k hours of data across 136 languages. To
build WavLabLM, we devise a novel multi-stage pre-training method, designed to
address the language imbalance of multilingual data. WavLabLM achieves
comparable performance to XLS-R on ML-SUPERB with less than 10% of the training
data, making SSL realizable with academic compute. We show that further
efficiency can be achieved with a vanilla HuBERT Base model, which can maintain
94% of XLS-R's performance with only 3% of the data, 4 GPUs, and limited
trials. We open-source all code and models in ESPnet.
Related papers
- Towards Robust Speech Representation Learning for Thousands of Languages [77.2890285555615]
Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data.
We propose XEUS, a Cross-lingual for Universal Speech, trained on over 1 million hours of data across 4057 languages.
arXiv Detail & Related papers (2024-06-30T21:40:26Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages [0.0]
Large Language Models (LLMs) have shown incredible proficiency at natural language processing tasks.
LLMs often struggle to perform well on low-resource languages because there is so little training data available.
In this work, we explore training LLaMA-2 to speak Amharic, a language which is spoken by over 50 million people world wide.
arXiv Detail & Related papers (2024-03-11T01:04:36Z) - Enhancing Multilingual Capabilities of Large Language Models through
Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora.
Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - How do languages influence each other? Studying cross-lingual data sharing during LM fine-tuning [14.02101305717738]
Multilingual large language models (MLLMs) are jointly trained on data from many different languages.
It remains unclear to what extent, and under which conditions, languages rely on each other's data.
We find that MLLMs rely on data from multiple languages from the early stages of fine-tuning and that this reliance gradually increases as fine-tuning progresses.
arXiv Detail & Related papers (2023-05-22T17:47:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.