What Do Self-Supervised Speech Models Know About Words?
- URL: http://arxiv.org/abs/2307.00162v3
- Date: Wed, 31 Jan 2024 05:00:25 GMT
- Title: What Do Self-Supervised Speech Models Know About Words?
- Authors: Ankita Pasad, Chung-Ming Chien, Shane Settle, Karen Livescu
- Abstract summary: Self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks.
Recent work has begun analyzing how S3Ms encode certain properties, such as phonetic and speaker information.
We use lightweight analysis methods to study segment-level linguistic properties encoded in S3Ms.
- Score: 23.163029143563893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many self-supervised speech models (S3Ms) have been introduced over the last
few years, improving performance and data efficiency on various speech tasks.
However, these empirical successes alone do not give a complete picture of what
is learned during pre-training. Recent work has begun analyzing how S3Ms encode
certain properties, such as phonetic and speaker information, but we still lack
a proper understanding of knowledge encoded at the word level and beyond. In
this work, we use lightweight analysis methods to study segment-level
linguistic properties -- word identity, boundaries, pronunciation, syntactic
features, and semantic features -- encoded in S3Ms. We present a comparative
study of layer-wise representations from ten S3Ms and find that (i) the
frame-level representations within each word segment are not all equally
informative, and (ii) the pre-training objective and model size heavily
influence the accessibility and distribution of linguistic information across
layers. We also find that on several tasks -- word discrimination, word
segmentation, and semantic sentence similarity -- S3Ms trained with visual
grounding outperform their speech-only counterparts. Finally, our task-based
analyses demonstrate improved performance on word segmentation and acoustic
word discrimination while using simpler methods than prior work.
Related papers
- Self-Supervised Speech Representations are More Phonetic than Semantic [52.02626675137819]
Self-supervised speech models (S3Ms) have become an effective backbone for speech applications.
We seek a more fine-grained analysis of the word-level linguistic properties encoded in S3Ms.
Our study reveals that S3M representations consistently and significantly exhibit more phonetic than semantic similarity.
arXiv Detail & Related papers (2024-06-12T20:04:44Z) - Few-Shot Spoken Language Understanding via Joint Speech-Text Models [18.193191170754744]
Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations.
We leverage such shared representations to address the persistent challenge of limited data availability in spoken language understanding tasks.
By employing a pre-trained speech-text model, we find that models fine-tuned on text can be effectively transferred to speech testing data.
arXiv Detail & Related papers (2023-10-09T17:59:21Z) - Leverage Points in Modality Shifts: Comparing Language-only and
Multimodal Word Representations [0.8594140167290097]
Multimodal embeddings aim to enrich the semantic information in neural representations of language compared to text-only models.
Our paper compares word embeddings from three vision-and-language models and three text-only models, with static and contextual representations.
This is the first large-scale study of the effect of visual grounding on language representations, including 46 semantic parameters.
arXiv Detail & Related papers (2023-06-04T12:53:12Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - On Vocabulary Reliance in Scene Text Recognition [79.21737876442253]
Methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary.
We call this phenomenon "vocabulary reliance"
We propose a simple yet effective mutual learning strategy to allow models of two families to learn collaboratively.
arXiv Detail & Related papers (2020-05-08T11:16:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.