Related papers: How Does a Deep Neural Network Look at Lexical Stress?

How Does a Deep Neural Network Look at Lexical Stress?

URL: http://arxiv.org/abs/2508.07229v2
Date: Mon, 10 Nov 2025 10:16:06 GMT
Title: How Does a Deep Neural Network Look at Lexical Stress?
Authors: Itai Allouche, Itay Asael, Rotem Rousso, Vered Dassa, Ann Bradlow, Seung-Eun Kim, Matthew Goldrick, Joseph Keshet,
Abstract summary: A dataset of English disyllabic words was automatically constructed from read and spontaneous speech.<n>CNN architectures were trained to predict stress position from a spectrographic representation of disyllabic words lacking minimal stress pairs.<n>A feature-specific relevance analysis is proposed, and its results suggest that our best-performing classifier is strongly influenced by the stressed vowel's first and second formants.
Score: 7.14461117742142
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite their success in speech processing, neural networks often operate as black boxes, prompting the question: what informs their decisions, and how can we interpret them? This work examines this issue in the context of lexical stress. A dataset of English disyllabic words was automatically constructed from read and spontaneous speech. Several Convolutional Neural Network (CNN) architectures were trained to predict stress position from a spectrographic representation of disyllabic words lacking minimal stress pairs (e.g., initial stress WAllet, final stress exTEND), achieving up to 92% accuracy on held-out test data. Layerwise Relevance Propagation (LRP), a technique for CNN interpretability analysis, revealed that predictions for held-out minimal pairs (PROtest vs. proTEST ) were most strongly influenced by information in stressed versus unstressed syllables, particularly the spectral properties of stressed vowels. However, the classifiers also attended to information throughout the word. A feature-specific relevance analysis is proposed, and its results suggest that our best-performing classifier is strongly influenced by the stressed vowel's first and second formants, with some evidence that its pitch and third formant also contribute. These results reveal deep learning's ability to acquire distributed cues to stress from naturally occurring data, extending traditional phonetic work based around highly controlled stimuli.

Related papers

StressTest: Can YOUR Speech LM Handle the Stress? [20.802090523583196]
Sentence stress refers to emphasis placed on specific words within a spoken utterance to highlight or contrast an idea, or to introduce new information.<n>Recent advances in speech-aware language models (SLMs) have enabled direct processing of audio.<n>Despite the crucial role of sentence stress in shaping meaning and speaker intent, it remains largely overlooked in evaluation and development of such models.
arXiv Detail & Related papers (2025-05-28T18:32:56Z)
WHISTRESS: Enriching Transcriptions with Sentence Stress Detection [20.802090523583196]
Sentence stress is crucial for conveying speaker intent in spoken language.<n>We introduce WHISTRESS, an alignment-free approach for enhancing transcription systems with sentence stress detection.<n>We train WHISTRESS on TINYSTRESS-15K and evaluate it against several competitive baselines.
arXiv Detail & Related papers (2025-05-25T11:45:08Z)
Surprise! Uniform Information Density Isn't the Whole Story: Predicting Surprisal Contours in Long-form Discourse [54.08750245737734]
We propose that speakers modulate information rate based on location within a hierarchically-structured model of discourse. We find that hierarchical predictors are significant predictors of a discourse's information contour and that deeply nested hierarchical predictors are more predictive than shallow ones.
arXiv Detail & Related papers (2024-10-21T14:42:37Z)
Deep Neural Convolutive Matrix Factorization for Articulatory Representation Decomposition [48.56414496900755]
This work uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores. Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully.
arXiv Detail & Related papers (2022-04-01T14:25:19Z)
Hybrid Handcrafted and Learnable Audio Representation for Analysis of Speech Under Cognitive and Physical Load [17.394964035035866]
We introduce a set of five datasets for task load detection in speech. The voice recordings were collected as either cognitive or physical stress was induced in the cohort of volunteers. We used the datasets to design and evaluate a novel self-supervised audio representation.
arXiv Detail & Related papers (2022-03-30T19:43:21Z)
Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding. By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding. Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z)
A Tale of Two Lexica Testing Computational Hypotheses with Deep Convolutional Neural Networks [0.0]
We investigate the existence of two parallel wordform stores: the dorsal and ventral processing streams. We created two deep convolutional neural networks (CNNs) to test the hypothesis. Our results are consistent with the hypothesis that the divergent processing demands of the ventral and dorsal processing streams impose computational pressures for the development of multiple lexica.
arXiv Detail & Related papers (2021-04-13T15:03:14Z)
Enhanced Aspect-Based Sentiment Analysis Models with Progressive Self-supervised Attention Learning [103.0064298630794]
In aspect-based sentiment analysis (ABSA), many neural models are equipped with an attention mechanism to quantify the contribution of each context word to sentiment prediction. We propose a progressive self-supervised attention learning approach for attentional ABSA models. We integrate the proposed approach into three state-of-the-art neural ABSA models.
arXiv Detail & Related papers (2021-03-05T02:50:05Z)
Be More with Less: Hypergraph Attention Networks for Inductive Text Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task. Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words. We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z)
Measuring Memorization Effect in Word-Level Neural Networks Probing [0.9156064716689833]
We propose a simple general method for measuring the memorization effect, based on a symmetric selection of test words seen versus unseen in training. Our method can be used to explicitly quantify the amount of memorization happening in a probing setup, so that an adequate setup can be chosen and the results of the probing can be interpreted with a reliability estimate.
arXiv Detail & Related papers (2020-06-29T14:35:42Z)
Syntactic Structure Distillation Pretraining For Bidirectional Encoders [49.483357228441434]
We introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining. We distill the approximate marginal distribution over words in context from the syntactic LM. Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data.
arXiv Detail & Related papers (2020-05-27T16:44:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.