MauBERT: Universal Phonetic Inductive Biases for Few-Shot Acoustic Units Discovery
- URL: http://arxiv.org/abs/2512.19612v1
- Date: Mon, 22 Dec 2025 17:47:49 GMT
- Title: MauBERT: Universal Phonetic Inductive Biases for Few-Shot Acoustic Units Discovery
- Authors: Angelo Ortiz Tandazo, Manel Khentout, Youssef Benchekroun, Thomas Hueber, Emmanuel Dupoux,
- Abstract summary: MauBERT is a multilingual extension of HuBERT that leverages articulatory features for cross-lingual phonetic representation learning.<n>Our models learn from multilingual data to predict articulatory features or phones, resulting in language-independent representations.<n>We show MauBERT models produce more context-invariant representations than state-of-the-art multilingual self-supervised learning models.
- Score: 12.442501000586995
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces MauBERT, a multilingual extension of HuBERT that leverages articulatory features for robust cross-lingual phonetic representation learning. We continue HuBERT pre-training with supervision based on a phonetic-to-articulatory feature mapping in 55 languages. Our models learn from multilingual data to predict articulatory features or phones, resulting in language-independent representations that capture multilingual phonetic properties. Through comprehensive ABX discriminability testing, we show MauBERT models produce more context-invariant representations than state-of-the-art multilingual self-supervised learning models. Additionally, the models effectively adapt to unseen languages and casual speech with minimal self-supervised fine-tuning (10 hours of speech). This establishes an effective approach for instilling linguistic inductive biases in self-supervised speech models.
Related papers
- GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness [43.67571101152883]
We introduce GOAT-SLM, a novel spoken language model with paralinguistic and speaker characteristic awareness.<n> GOAT-SLM adopts a dual-modality head architecture that decouples linguistic modeling from acoustic realization.<n>We show that GOAT-SLM well-balanced performance across both semantic and non-semantic tasks, and outperforms existing open-source models in handling emotion, dialectal variation, and age-sensitive interactions.
arXiv Detail & Related papers (2025-07-24T06:10:29Z) - EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions [152.41217651729738]
We propose the EMOVA (EMotionally Omni-present Voice Assistant) to enable Large Language Models with end-to-end speech abilities.<n>With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities.<n>For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks.
arXiv Detail & Related papers (2024-09-26T16:44:02Z) - Language-Universal Phonetic Representation in Multilingual Speech
Pretraining for Low-Resource Speech Recognition [28.21805271848413]
We leverage an International Phonetic Alphabet (IPA) multilingual model to create frame-level pseudo labels for unlabeled speech.
We use these pseudo labels to guide hidden-unit BERT (HuBERT) based speech pretraining in a phonetically-informed manner.
Our approach outperforms most of the state of the arts, with much less pretraining data in terms of hours and language diversity.
arXiv Detail & Related papers (2023-05-19T10:15:11Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Distilling a Pretrained Language Model to a Multilingual ASR Model [3.4012007729454816]
We distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model.
We show the superiority of our method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.
arXiv Detail & Related papers (2022-06-25T12:36:11Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.