Related papers: Assessing the Impact of Anisotropy in Neural Representations of Speech: A Case Study on Keyword Spotting

Assessing the Impact of Anisotropy in Neural Representations of Speech: A Case Study on Keyword Spotting

URL: http://arxiv.org/abs/2506.11096v1
Date: Fri, 06 Jun 2025 08:52:56 GMT
Title: Assessing the Impact of Anisotropy in Neural Representations of Speech: A Case Study on Keyword Spotting
Authors: Guillaume Wisniewski, Séverine Guillaume, Clara Rosina Fernández,
Abstract summary: This work evaluates anisotropy in keyword spotting for computational documentary linguistics.<n>We show that despite anisotropy, wav2vec2 similarity measures effectively identify words without transcription.
Score: 4.342241136871849
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pretrained speech representations like wav2vec2 and HuBERT exhibit strong anisotropy, leading to high similarity between random embeddings. While widely observed, the impact of this property on downstream tasks remains unclear. This work evaluates anisotropy in keyword spotting for computational documentary linguistics. Using Dynamic Time Warping, we show that despite anisotropy, wav2vec2 similarity measures effectively identify words without transcription. Our results highlight the robustness of these representations, which capture phonetic structures and generalize across speakers. Our results underscore the importance of pretraining in learning rich and invariant speech representations.

Related papers

Audio-Visual Neural Syntax Acquisition [91.14892278795892]
We study phrase structure induction from visually-grounded speech. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without ever being exposed to text.
arXiv Detail & Related papers (2023-10-11T16:54:57Z)
An Information-Theoretic Analysis of Self-supervised Discrete Representations of Speech [17.07957283733822]
We develop an information-theoretic framework whereby we represent each phonetic category as a distribution over discrete units. Our study demonstrates that the entropy of phonetic distributions reflects the variability of the underlying speech sounds. While our study confirms the lack of direct, one-to-one correspondence, we find an intriguing, indirect relationship between phonetic categories and discrete units.
arXiv Detail & Related papers (2023-06-04T16:52:11Z)
Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis [3.691712391306624]
We show that the fine-grained latent space also captures coarse-grained information, which is more evident as the dimension of latent space increases in order to capture diverse prosodic representations. We alleviate this issue by first capturing rich speech attributes into a token-level latent space and then, separately train a prior network that given the input text, learns utterance-level representations in order to predict the phoneme-level, posterior latents extracted during the previous step.
arXiv Detail & Related papers (2022-11-01T15:17:25Z)
Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS. We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not. We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z)
Deep Neural Convolutive Matrix Factorization for Articulatory Representation Decomposition [48.56414496900755]
This work uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores. Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully.
arXiv Detail & Related papers (2022-04-01T14:25:19Z)
Probing Speech Emotion Recognition Transformers for Linguistic Knowledge [7.81884995637243]
We investigate the extent in which linguistic information is exploited during speech emotion recognition fine-tuning. We synthesise prosodically neutral speech utterances while varying the sentiment of the text. Valence predictions of the transformer model are very reactive to positive and negative sentiment content, as well as negations, but not to intensifiers or reducers.
arXiv Detail & Related papers (2022-04-01T12:47:45Z)
A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes. We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z)
Unsupervised Multimodal Word Discovery based on Double Articulation Analysis with Co-occurrence cues [7.332652485849632]
Human infants acquire their verbal lexicon with minimal prior knowledge of language. This study proposes a novel fully unsupervised learning method for discovering speech units. The proposed method can acquire words and phonemes from speech signals using unsupervised learning.
arXiv Detail & Related papers (2022-01-18T07:31:59Z)
Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study [12.210797811981173]
In this paper, we ask: does the distance in the acoustic embedding space correlate with phonological dissimilarity? We train AWE models in controlled settings for two languages (German and Czech) and evaluate the embeddings on two tasks: word discrimination and phonological similarity. Our experiments show that (1) the distance in the embedding space in the best cases only moderately correlates with phonological distance, and (2) improving the performance on the word discrimination task does not necessarily yield models that better reflect word phonological similarity.
arXiv Detail & Related papers (2021-06-16T10:47:56Z)
Accounting for Agreement Phenomena in Sentence Comprehension with Transformer Language Models: Effects of Similarity-based Interference on Surprisal and Attention [4.103438743479001]
We advance an explanation of similarity-based interference effects in subject-verb and reflexive pronoun agreement processing. We show that surprisal of the verb or reflexive pronoun predicts facilitatory interference effects in ungrammatical sentences.
arXiv Detail & Related papers (2021-04-26T20:46:54Z)
Decomposing lexical and compositional syntax and semantics with deep language models [82.81964713263483]
The activations of language transformers like GPT2 have been shown to linearly map onto brain activity during speech comprehension. Here, we propose a taxonomy to factorize the high-dimensional activations of language models into four classes: lexical, compositional, syntactic, and semantic representations. The results highlight two findings. First, compositional representations recruit a more widespread cortical network than lexical ones, and encompass the bilateral temporal, parietal and prefrontal cortices.
arXiv Detail & Related papers (2021-03-02T10:24:05Z)
Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.