Emphasis Sensitivity in Speech Representations
- URL: http://arxiv.org/abs/2508.11566v1
- Date: Fri, 15 Aug 2025 16:18:47 GMT
- Title: Emphasis Sensitivity in Speech Representations
- Authors: Shaun Cassini, Thomas Hain, Anton Ragni,
- Abstract summary: This paper proposes a residual-based framework, defining emphasis as the difference between paired neutral and emphasized word representations.<n>Analysis on self-supervised speech models shows that these residuals correlate strongly with duration changes and perform poorly at word identity prediction.<n>In ASR fine-tuned models, residuals occupy a subspace up to 50% more compact than in pre-trained models.
- Score: 19.211263411383623
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work investigates whether modern speech models are sensitive to prosodic emphasis - whether they encode emphasized and neutral words in systematically different ways. Prior work typically relies on isolated acoustic correlates (e.g., pitch, duration) or label prediction, both of which miss the relational structure of emphasis. This paper proposes a residual-based framework, defining emphasis as the difference between paired neutral and emphasized word representations. Analysis on self-supervised speech models shows that these residuals correlate strongly with duration changes and perform poorly at word identity prediction, indicating a structured, relational encoding of prosodic emphasis. In ASR fine-tuned models, residuals occupy a subspace up to 50% more compact than in pre-trained models, further suggesting that emphasis is encoded as a consistent, low-dimensional transformation that becomes more structured with task-specific learning.
Related papers
- Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs [15.914430317382077]
We analyze how speech and text representations evolve layer-by-layer.<n>We find that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech.
arXiv Detail & Related papers (2026-03-02T06:21:43Z) - Priors in Time: Missing Inductive Biases for Language Model Interpretability [58.07412640266836]
We show that Sparse Autoencoders impose priors that assume independence of concepts across time, implying stationarity.<n>We introduce a new interpretability objective -- Temporal Feature Analysis -- which possesses a temporal inductive bias to decompose representations at a given time into two parts.<n>Our results underscore the need for inductive biases that match the data in designing robust interpretability tools.
arXiv Detail & Related papers (2025-11-03T18:43:48Z) - Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability [31.30541946703775]
Translating internal representations and computations of models into concepts that humans can understand is a key goal of interpretability.<n>Recent dictionary learning methods such as Sparse Autoencoders provide a promising route to discover human-interpretable features.<n>But they exhibit a bias towards shallow, token-specific, or noisy features, such as "the phrase 'The' at the start of sentences"
arXiv Detail & Related papers (2025-10-30T17:59:30Z) - New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR [30.00166986946003]
We take a new insight to regard alignment and matching as a detection problem.<n>The goal is to identify meaningful correspondences with high precision and recall ensuring full coverage of linguistic tokens.<n>We propose an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries.
arXiv Detail & Related papers (2025-09-06T05:58:52Z) - On the Geometry of Semantics in Next-token Prediction [27.33243506775655]
Modern language models capture linguistic meaning despite being trained solely through next-token prediction.<n>We investigate how this conceptually simple training objective leads models to extract and encode latent semantic and grammatical concepts.<n>Our work bridges distributional semantics, neural collapse geometry, and neural network training dynamics, providing insights into how NTP's implicit biases shape the emergence of meaning representations in language models.
arXiv Detail & Related papers (2025-05-13T08:46:04Z) - Residual Speech Embeddings for Tone Classification: Removing Linguistic Content to Enhance Paralinguistic Analysis [2.0499240875882]
We introduce a method for disentangling paralinguistic features from linguistic content by regressing speech embeddings onto their corresponding text embeddings.<n>We evaluate this approach across multiple self-supervised speech embeddings, demonstrating that residual embeddings significantly improve tone classification performance.<n>These findings highlight the potential of residual embeddings for applications in sentiment analysis, speaker characterization, and paralinguistic speech processing.
arXiv Detail & Related papers (2025-02-26T18:32:15Z) - Word-specific tonal realizations in Mandarin [0.9249657468385781]
This study shows that tonal realization is also partially determined by words' meanings.<n>We first show, on the basis of a corpus of Taiwan Mandarin spontaneous conversations, that word type is a stronger predictor of tonal realization than all the previously established word-form related predictors combined.<n>We then proceed to show, using computational modeling with context-specific word embeddings, that token-specific pitch contours predict word type with 50% accuracy on held-out data.
arXiv Detail & Related papers (2024-05-11T13:00:35Z) - Spoken Word2Vec: Learning Skipgram Embeddings from Speech [0.8901073744693314]
We show how shallow skipgram-like algorithms fail to encode distributional semantics when the input units are acoustically correlated.
We illustrate the potential of an alternative deep end-to-end variant of the model and examine the effects on the resulting embeddings.
arXiv Detail & Related papers (2023-11-15T19:25:29Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - Prototypical Representation Learning for Relation Extraction [56.501332067073065]
This paper aims to learn predictive, interpretable, and robust relation representations from distantly-labeled data.
We learn prototypes for each relation from contextual information to best explore the intrinsic semantics of relations.
Results on several relation learning tasks show that our model significantly outperforms the previous state-of-the-art relational models.
arXiv Detail & Related papers (2021-03-22T08:11:43Z) - Unsupervised Distillation of Syntactic Information from Contextualized
Word Representations [62.230491683411536]
We tackle the task of unsupervised disentanglement between semantics and structure in neural language representations.
To this end, we automatically generate groups of sentences which are structurally similar but semantically different.
We demonstrate that our transformation clusters vectors in space by structural properties, rather than by lexical semantics.
arXiv Detail & Related papers (2020-10-11T15:13:18Z) - High-order Semantic Role Labeling [86.29371274587146]
This paper introduces a high-order graph structure for the neural semantic role labeling model.
It enables the model to explicitly consider not only the isolated predicate-argument pairs but also the interaction between the predicate-argument pairs.
Experimental results on 7 languages of the CoNLL-2009 benchmark show that the high-order structural learning techniques are beneficial to the strong performing SRL models.
arXiv Detail & Related papers (2020-10-09T15:33:54Z) - Temporal Embeddings and Transformer Models for Narrative Text
Understanding [72.88083067388155]
We present two approaches to narrative text understanding for character relationship modelling.
The temporal evolution of these relations is described by dynamic word embeddings, that are designed to learn semantic changes over time.
A supervised learning approach based on the state-of-the-art transformer model BERT is used instead to detect static relations between characters.
arXiv Detail & Related papers (2020-03-19T14:23:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.