Probing self-supervised speech models for phonetic and phonemic
information: a case study in aspiration
- URL: http://arxiv.org/abs/2306.06232v1
- Date: Fri, 9 Jun 2023 20:07:22 GMT
- Title: Probing self-supervised speech models for phonetic and phonemic
information: a case study in aspiration
- Authors: Kinan Martin, Jon Gauthier, Canaan Breiss, Roger Levy
- Abstract summary: We evaluate the extent to which these models' learned representations align with basic representational distinctions made by humans.
We find that robust representations of both phonetic and phonemic distinctions emerge in early layers of these models' architectures.
Our findings show that speech-trained HuBERT derives a low-noise and low-dimensional subspace corresponding to abstract phonological distinctions.
- Score: 17.94683764469626
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Textless self-supervised speech models have grown in capabilities in recent
years, but the nature of the linguistic information they encode has not yet
been thoroughly examined. We evaluate the extent to which these models' learned
representations align with basic representational distinctions made by humans,
focusing on a set of phonetic (low-level) and phonemic (more abstract)
contrasts instantiated in word-initial stops. We find that robust
representations of both phonetic and phonemic distinctions emerge in early
layers of these models' architectures, and are preserved in the principal
components of deeper layer representations. Our analyses suggest two sources
for this success: some can only be explained by the optimization of the models
on speech data, while some can be attributed to these models' high-dimensional
architectures. Our findings show that speech-trained HuBERT derives a low-noise
and low-dimensional subspace corresponding to abstract phonological
distinctions.
Related papers
- Sylber: Syllabic Embedding Representation of Speech from Raw Audio [25.703703711031178]
We propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure.
Specifically, we propose a self-supervised model that regresses features on syllabic segments distilled from a teacher model which is an exponential moving average of the model in training.
This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) syllabic units better suited for lexical and syntactic understanding.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic
Organization in HuBERT [49.06057768982775]
We show that a syllabic organization emerges in learning sentence-level representation of speech.
We propose a new benchmark task, Spoken Speech ABX, for evaluating sentence-level representation of speech.
arXiv Detail & Related papers (2023-10-16T20:05:36Z) - Self-Supervised Models of Speech Infer Universal Articulatory Kinematics [44.27187669492598]
We show "inference of articulatory kinematics" as fundamental property of SSL models.
We also show that this abstraction is largely overlapping across the language of the data used to train the model.
We show that with simple affine transformations, Acoustic-to-Articulatory inversion (AAI) is transferrable across speakers, even across genders, languages, and dialects.
arXiv Detail & Related papers (2023-10-16T19:50:01Z) - Wave to Syntax: Probing spoken language models for syntax [16.643072915927313]
We focus on the encoding of syntax in several self-supervised and visually grounded models of spoken language.
We show that syntax is captured most prominently in the middle layers of the networks, and more explicitly within models with more parameters.
arXiv Detail & Related papers (2023-05-30T11:43:18Z) - Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic
Word Embeddings [19.195728241989702]
We propose a multi-task learning model that incorporates top-down lexical knowledge into the training procedure of acoustic word embeddings.
We experiment with three languages and demonstrate that incorporating lexical knowledge improves the embedding space discriminability.
arXiv Detail & Related papers (2022-09-14T13:33:04Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.