SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic
Organization in HuBERT
- URL: http://arxiv.org/abs/2310.10803v2
- Date: Tue, 16 Jan 2024 05:54:49 GMT
- Title: SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic
Organization in HuBERT
- Authors: Cheol Jun Cho, Abdelrahman Mohamed, Shang-Wen Li, Alan W Black and
Gopala K. Anumanchipalli
- Abstract summary: We show that a syllabic organization emerges in learning sentence-level representation of speech.
We propose a new benchmark task, Spoken Speech ABX, for evaluating sentence-level representation of speech.
- Score: 49.06057768982775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data-driven unit discovery in self-supervised learning (SSL) of speech has
embarked on a new era of spoken language processing. Yet, the discovered units
often remain in phonetic space and the units beyond phonemes are largely
underexplored. Here, we demonstrate that a syllabic organization emerges in
learning sentence-level representation of speech. In particular, we adopt
"self-distillation" objective to fine-tune the pretrained HuBERT with an
aggregator token that summarizes the entire sentence. Without any supervision,
the resulting model draws definite boundaries in speech, and the
representations across frames exhibit salient syllabic structures. We
demonstrate that this emergent structure largely corresponds to the ground
truth syllables. Furthermore, we propose a new benchmark task, Spoken Speech
ABX, for evaluating sentence-level representation of speech. When compared to
previous models, our model outperforms in both unsupervised syllable discovery
and learning sentence-level representation. Together, we demonstrate that the
self-distillation of HuBERT gives rise to syllabic organization without relying
on external labels or modalities, and potentially provides novel data-driven
units for spoken language modeling.
Related papers
- Sylber: Syllabic Embedding Representation of Speech from Raw Audio [25.703703711031178]
We propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure.
Specifically, we propose a self-supervised model that regresses features on syllabic segments distilled from a teacher model which is an exponential moving average of the model in training.
This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) syllabic units better suited for lexical and syntactic understanding.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT [10.18337180909434]
Self-supervised speech representation learning has become essential for extracting meaningful features from untranscribed audio.
We propose a speech-only self-supervised fine-tuning approach that separates syllabic units from speaker information.
arXiv Detail & Related papers (2024-09-16T09:07:08Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Probing self-supervised speech models for phonetic and phonemic
information: a case study in aspiration [17.94683764469626]
We evaluate the extent to which these models' learned representations align with basic representational distinctions made by humans.
We find that robust representations of both phonetic and phonemic distinctions emerge in early layers of these models' architectures.
Our findings show that speech-trained HuBERT derives a low-noise and low-dimensional subspace corresponding to abstract phonological distinctions.
arXiv Detail & Related papers (2023-06-09T20:07:22Z) - Syllable Discovery and Cross-Lingual Generalization in a Visually
Grounded, Self-Supervised Speech Model [21.286529902957724]
We show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective.
We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian.
arXiv Detail & Related papers (2023-05-19T05:19:04Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - SLM: Learning a Discourse Language Representation with Sentence
Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation.
We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.