Self-Supervised Models of Speech Infer Universal Articulatory Kinematics
- URL: http://arxiv.org/abs/2310.10788v2
- Date: Tue, 16 Jan 2024 08:09:15 GMT
- Title: Self-Supervised Models of Speech Infer Universal Articulatory Kinematics
- Authors: Cheol Jun Cho, Abdelrahman Mohamed, Alan W Black and Gopala K.
Anumanchipalli
- Abstract summary: We show "inference of articulatory kinematics" as fundamental property of SSL models.
We also show that this abstraction is largely overlapping across the language of the data used to train the model.
We show that with simple affine transformations, Acoustic-to-Articulatory inversion (AAI) is transferrable across speakers, even across genders, languages, and dialects.
- Score: 44.27187669492598
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-Supervised Learning (SSL) based models of speech have shown remarkable
performance on a range of downstream tasks. These state-of-the-art models have
remained blackboxes, but many recent studies have begun "probing" models like
HuBERT, to correlate their internal representations to different aspects of
speech. In this paper, we show "inference of articulatory kinematics" as
fundamental property of SSL models, i.e., the ability of these models to
transform acoustics into the causal articulatory dynamics underlying the speech
signal. We also show that this abstraction is largely overlapping across the
language of the data used to train the model, with preference to the language
with similar phonological system. Furthermore, we show that with simple affine
transformations, Acoustic-to-Articulatory inversion (AAI) is transferrable
across speakers, even across genders, languages, and dialects, showing the
generalizability of this property. Together, these results shed new light on
the internals of SSL models that are critical to their superior performance,
and open up new avenues into language-agnostic universal models for speech
engineering, that are interpretable and grounded in speech science.
Related papers
- Developing Acoustic Models for Automatic Speech Recognition in Swedish [6.5458610824731664]
This paper is concerned with automatic continuous speech recognition using trainable systems.
The aim of this work is to build acoustic models for spoken Swedish.
arXiv Detail & Related papers (2024-04-25T12:03:14Z) - Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation [6.225927189801006]
We propose a novel framework that incorporates comprehensive modeling of both syntactic and acoustic cues that are associated with pausing patterns.
Remarkably, our framework possesses the capability to consistently generate natural speech even for considerably more extended and intricate out-of-domain (OOD) sentences.
arXiv Detail & Related papers (2024-04-03T09:17:38Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Probing self-supervised speech models for phonetic and phonemic
information: a case study in aspiration [17.94683764469626]
We evaluate the extent to which these models' learned representations align with basic representational distinctions made by humans.
We find that robust representations of both phonetic and phonemic distinctions emerge in early layers of these models' architectures.
Our findings show that speech-trained HuBERT derives a low-noise and low-dimensional subspace corresponding to abstract phonological distinctions.
arXiv Detail & Related papers (2023-06-09T20:07:22Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - Self-supervised Neural Factor Analysis for Disentangling Utterance-level
Speech Representations [30.293081541301746]
Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition.
We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective.
Our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.
arXiv Detail & Related papers (2023-05-14T08:26:24Z) - The Ability of Self-Supervised Speech Models for Audio Representations [53.19715501273934]
Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning.
We conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of state-of-the-art SSL speech models.
Results show that SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets.
arXiv Detail & Related papers (2022-09-26T15:21:06Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - The Grammar-Learning Trajectories of Neural Language Models [42.32479280480742]
We show that neural language models acquire linguistic phenomena in a similar order, despite having different end performances over the data.
Results suggest that NLMs exhibit consistent developmental'' stages.
arXiv Detail & Related papers (2021-09-13T16:17:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.