Syllable Discovery and Cross-Lingual Generalization in a Visually
Grounded, Self-Supervised Speech Model
- URL: http://arxiv.org/abs/2305.11435v2
- Date: Sun, 23 Jul 2023 05:32:05 GMT
- Title: Syllable Discovery and Cross-Lingual Generalization in a Visually
Grounded, Self-Supervised Speech Model
- Authors: Puyuan Peng, Shang-Wen Li, Okko R\"as\"anen, Abdelrahman Mohamed,
David Harwath
- Abstract summary: We show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective.
We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian.
- Score: 21.286529902957724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we show that representations capturing syllabic units emerge
when training a self-supervised speech model with a visually-grounded training
objective. We demonstrate that a nearly identical model architecture (HuBERT)
trained with a masked language modeling loss does not exhibit this same
ability, suggesting that the visual grounding objective is responsible for the
emergence of this phenomenon. We propose the use of a minimum cut algorithm to
automatically predict syllable boundaries in speech, followed by a 2-stage
clustering method to group identical syllables together. We show that our model
not only outperforms a state-of-the-art syllabic segmentation method on the
language it was trained on (English), but also generalizes in a zero-shot
fashion to Estonian. Finally, we show that the same model is capable of
zero-shot generalization for a word segmentation task on 4 other languages from
the Zerospeech Challenge, in some cases beating the previous state-of-the-art.
Related papers
- Sylber: Syllabic Embedding Representation of Speech from Raw Audio [25.703703711031178]
We propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure.
Specifically, we propose a self-supervised model that regresses features on syllabic segments distilled from a teacher model which is an exponential moving average of the model in training.
This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) syllabic units better suited for lexical and syntactic understanding.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic
Organization in HuBERT [49.06057768982775]
We show that a syllabic organization emerges in learning sentence-level representation of speech.
We propose a new benchmark task, Spoken Speech ABX, for evaluating sentence-level representation of speech.
arXiv Detail & Related papers (2023-10-16T20:05:36Z) - Self-Supervised Models of Speech Infer Universal Articulatory Kinematics [44.27187669492598]
We show "inference of articulatory kinematics" as fundamental property of SSL models.
We also show that this abstraction is largely overlapping across the language of the data used to train the model.
We show that with simple affine transformations, Acoustic-to-Articulatory inversion (AAI) is transferrable across speakers, even across genders, languages, and dialects.
arXiv Detail & Related papers (2023-10-16T19:50:01Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - Analyzing Acoustic Word Embeddings from Pre-trained Self-supervised
Speech Models [30.30385903059709]
HuBERT representations with mean-pooling rival the state of the art on English AWEs.
Despite being trained only on English, HuBERT representations evaluated on Xitsonga, Mandarin, and French consistently outperform the multilingual model XLSR-53.
arXiv Detail & Related papers (2022-10-28T10:26:46Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - A Brief Overview of Unsupervised Neural Speech Representation Learning [12.850357461259197]
We review the development of unsupervised representation learning for speech over the last decade.
We identify two primary model categories: self-supervised methods and probabilistic latent variable models.
arXiv Detail & Related papers (2022-03-01T11:15:35Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - SLM: Learning a Discourse Language Representation with Sentence
Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation.
We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.