mHuBERT-147: A Compact Multilingual HuBERT Model
- URL: http://arxiv.org/abs/2406.06371v5
- Date: Thu, 21 Nov 2024 10:45:39 GMT
- Title: mHuBERT-147: A Compact Multilingual HuBERT Model
- Authors: Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu,
- Abstract summary: mHuBERT-147 is the first general-purpose multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data.
To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method.
Our findings indicate that mHuBERT147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.
- Score: 23.207762084023933
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations, our compact 95M parameter mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.
Related papers
- mmBERT: A Modern Multilingual Encoder with Annealed Language Learning [57.58071656545661]
mmBERT is an encoder-only language model pretrained on 3T tokens of multilingual text.<n>We add over 1700 low-resource languages to the data mix only during the decay phase.<n>We show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks.
arXiv Detail & Related papers (2025-09-08T17:08:42Z) - Multilingual JobBERT for Cross-Lingual Job Title Matching [5.284778677072807]
JobBERT-V3 is a contrastive learning-based model for cross-lingual job title matching.<n>Our approach extends support to English, German, Spanish, and Chinese by leveraging synthetic translations.<n>JobBERT-V3 achieves consistent performance across both monolingual and cross-lingual settings.
arXiv Detail & Related papers (2025-07-29T09:06:09Z) - AfriHuBERT: A self-supervised speech representation model for African languages [44.722780475475915]
AfriHuBERT is an extension of mHuBERT-147, a compact self-supervised learning (SSL) model pretrained on 147 languages.<n>While mHuBERT-147 covered 16 African languages, we expand this to 1,226 through continued pretraining on 10K+ hours of speech data from diverse sources.<n>We evaluate AfriHuBERT on two key speech tasks, Spoken Language Identification (SLID) and Automatic Speech Recognition (ASR)<n>Our results show a +3.6% F1 score improvement for SLID and a -2.1% average Word Error Rate (WER) reduction for ASR over mHuBERT-147,
arXiv Detail & Related papers (2024-09-30T11:28:33Z) - Data-Efficient French Language Modeling with CamemBERTa [0.0]
We introduce CamemBERTa, a French DeBERTa model that builds upon the DeBERTaV3 architecture and training objective.
We evaluate our model's performance on a variety of French downstream tasks and datasets.
arXiv Detail & Related papers (2023-06-02T12:45:34Z) - Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [76.95115818308918]
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.
This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages.
We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
arXiv Detail & Related papers (2023-03-02T07:47:18Z) - Beyond English-Centric Bitexts for Better Multilingual Language
Representation Learning [99.42850643947439]
We show that going beyond English-centric bitexts, coupled with a novel sampling strategy, substantially boosts performance across model sizes.
Our XY-LENT XL variant outperforms XLM-RXXL and exhibits competitive performance with mT5 XXL while being 5x and 6x smaller respectively.
arXiv Detail & Related papers (2022-10-26T17:16:52Z) - DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model.
vanilla embedding sharing in ELECTRA hurts training efficiency and model performance.
We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z) - Learning Compact Metrics for MT [21.408684470261342]
We investigate the trade-off between multilinguality and model capacity with RemBERT, a state-of-the-art multilingual language model.
We show that model size is indeed a bottleneck for cross-lingual transfer, then demonstrate how distillation can help addressing this bottleneck.
Our method yields up to 10.5% improvement over vanilla fine-tuning and reaches 92.6% of RemBERT's performance using only a third of its parameters.
arXiv Detail & Related papers (2021-10-12T20:39:35Z) - Larger-Scale Transformers for Multilingual Masked Language Modeling [16.592883204398518]
Two new models dubbed XLM-R XL and XLM-R XXL outperform XLM-R by 1.8% and 2.4% average accuracy on XNLI.
Our model also outperforms the RoBERTa-Large model on several English tasks of the GLUE benchmark by 0.3% on average while handling 99 more languages.
arXiv Detail & Related papers (2021-05-02T23:15:02Z) - Evaluating Contextualized Language Models for Hungarian [0.0]
We compare huBERT, a Hungarian model against 4 multilingual models including the multilingual BERT model.
We find that huBERT works better than the other models, often by a large margin, particularly near the global optimum.
arXiv Detail & Related papers (2021-02-22T09:29:01Z) - Multilingual Speech Translation with Efficient Finetuning of Pretrained
Models [82.22294901727933]
A minimalistic LNA (LayerNorm and Attention) finetuning can achieve zero-shot crosslingual and cross-modality transfer ability.
Our approach demonstrates strong zero-shot performance in a many-to-many multilingual model.
arXiv Detail & Related papers (2020-10-24T08:15:08Z) - DeBERTa: Decoding-enhanced BERT with Disentangled Attention [119.77305080520718]
We propose a new model architecture DeBERTa that improves the BERT and RoBERTa models using two novel techniques.
We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks.
arXiv Detail & Related papers (2020-06-05T19:54:34Z) - InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining [76.32065400614162]
We propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6.
The model owns strong capability of modeling interaction between the information flows of different modalities.
We propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model.
arXiv Detail & Related papers (2020-03-30T03:13:22Z) - RobBERT: a Dutch RoBERTa-based Language Model [9.797319790710711]
We use RoBERTa to train a Dutch language model called RobBERT.
We measure its performance on various tasks as well as the importance of the fine-tuning dataset size.
RobBERT improves state-of-the-art results for various tasks, and especially significantly outperforms other models when dealing with smaller datasets.
arXiv Detail & Related papers (2020-01-17T13:25:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.