StressTest: Can YOUR Speech LM Handle the Stress?
- URL: http://arxiv.org/abs/2505.22765v2
- Date: Sun, 05 Oct 2025 12:21:35 GMT
- Title: StressTest: Can YOUR Speech LM Handle the Stress?
- Authors: Iddo Yosha, Gallil Maimon, Yossi Adi,
- Abstract summary: Sentence stress refers to emphasis on words within a spoken utterance to highlight or contrast an idea.<n>We introduce StressTest, a benchmark designed to evaluate models' ability to distinguish between meanings of speech based on the stress pattern.<n>We propose a novel data generation pipeline, and create Stress-17k, a training set that simulates change of meaning implied by stress variation.
- Score: 30.973919141559644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sentence stress refers to emphasis on words within a spoken utterance to highlight or contrast an idea. It is often used to imply an underlying intention not explicitly stated. Recent speech-aware language models (SLMs) have enabled direct audio processing, allowing models to access the full richness of speech to perform audio reasoning tasks such as spoken question answering. Despite the crucial role of sentence stress in shaping meaning and intent, it remains largely overlooked in evaluation and development of SLMs. We address this gap by introducing StressTest, a benchmark designed to evaluate models' ability to distinguish between meanings of speech based on the stress pattern. We evaluate leading SLMs, and find that despite their overall capabilities, they perform poorly on such tasks. Hence, we propose a novel data generation pipeline, and create Stress-17k, a training set that simulates change of meaning implied by stress variation. Results suggest, that our finetuned model, StresSLM, generalizes well to real recordings and notably outperforms existing SLMs on sentence stress reasoning and detection. Models, code, data, samples - pages.cs.huji.ac.il/adiyoss-lab/stresstest.
Related papers
- How Does a Deep Neural Network Look at Lexical Stress? [7.14461117742142]
A dataset of English disyllabic words was automatically constructed from read and spontaneous speech.<n>CNN architectures were trained to predict stress position from a spectrographic representation of disyllabic words lacking minimal stress pairs.<n>A feature-specific relevance analysis is proposed, and its results suggest that our best-performing classifier is strongly influenced by the stressed vowel's first and second formants.
arXiv Detail & Related papers (2025-08-10T08:13:40Z) - SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models [60.72029578488467]
SpeechR is a unified benchmark for evaluating reasoning over speech in large audio-language models.<n>It evaluates models along three key dimensions: factual retrieval, procedural inference, and normative judgment.<n> Evaluations on eleven state-of-the-art LALMs reveal that high transcription accuracy does not translate into strong reasoning capabilities.
arXiv Detail & Related papers (2025-08-04T03:28:04Z) - STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models [131.90117151306993]
Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses.<n>Current SLMs lack the ability to perform an internal, unspoken thinking process before responding.<n>We propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks.
arXiv Detail & Related papers (2025-07-21T08:30:03Z) - Word stress in self-supervised speech models: A cross-linguistic comparison [6.552278017383513]
We study word stress representations learned by self-supervised speech models (S3M)<n>We investigate the S3M representations of word stress for five different languages.
arXiv Detail & Related papers (2025-07-07T08:10:26Z) - Counterfactual reasoning: an analysis of in-context emergence [49.58529868457226]
Large-scale neural language models (LMs) exhibit remarkable performance in in-context learning.<n>This work studies in-context counterfactual reasoning in language models, that is, to predict the consequences of changes under hypothetical scenarios.
arXiv Detail & Related papers (2025-06-05T16:02:07Z) - WHISTRESS: Enriching Transcriptions with Sentence Stress Detection [20.802090523583196]
Sentence stress is crucial for conveying speaker intent in spoken language.<n>We introduce WHISTRESS, an alignment-free approach for enhancing transcription systems with sentence stress detection.<n>We train WHISTRESS on TINYSTRESS-15K and evaluate it against several competitive baselines.
arXiv Detail & Related papers (2025-05-25T11:45:08Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - Detecting Syllable-Level Pronunciation Stress with A Self-Attention
Model [0.0]
Knowing the stress level for each syllable of spoken English is important for English speakers and learners.
This paper presents a self-attention model to identify the stress level for each syllable of spoken English.
arXiv Detail & Related papers (2023-11-01T05:05:49Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - Adapting an ASR Foundation Model for Spoken Language Assessment [40.402050390096456]
A crucial part of an accurate and reliable spoken language assessment system is the underlying ASR model.
Recently, large-scale pre-trained ASR foundation models such as Whisper have been made available.
These models have a tendency to skip disfluencies and hesitations in the output.
Here a precise transcription of what a candidate said is needed.
arXiv Detail & Related papers (2023-07-13T16:01:58Z) - Speaker Embeddings as Individuality Proxy for Voice Stress Detection [14.332772222772668]
Since the mental states of the speaker modulate speech, stress introduced by cognitive or physical loads could be detected in the voice.
The existing voice stress detection benchmark has shown that the audio embeddings extracted from the Hybrid BYOL-S self-supervised model perform well.
This paper presents the design and development of voice stress detection, trained on more than 100 speakers from 9 language groups and five different types of stress.
arXiv Detail & Related papers (2023-06-09T14:11:07Z) - Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR)
We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters.
The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.