StressTest: Can YOUR Speech LM Handle the Stress?
- URL: http://arxiv.org/abs/2505.22765v1
- Date: Wed, 28 May 2025 18:32:56 GMT
- Title: StressTest: Can YOUR Speech LM Handle the Stress?
- Authors: Iddo Yosha, Gallil Maimon, Yossi Adi,
- Abstract summary: Sentence stress refers to emphasis placed on specific words within a spoken utterance to highlight or contrast an idea, or to introduce new information.<n>Recent advances in speech-aware language models (SLMs) have enabled direct processing of audio.<n>Despite the crucial role of sentence stress in shaping meaning and speaker intent, it remains largely overlooked in evaluation and development of such models.
- Score: 20.802090523583196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sentence stress refers to emphasis, placed on specific words within a spoken utterance to highlight or contrast an idea, or to introduce new information. It is often used to imply an underlying intention that is not explicitly stated. Recent advances in speech-aware language models (SLMs) have enabled direct processing of audio, allowing models to bypass transcription and access the full richness of the speech signal and perform audio reasoning tasks such as spoken question answering. Despite the crucial role of sentence stress in shaping meaning and speaker intent, it remains largely overlooked in evaluation and development of such models. In this work, we address this gap by introducing StressTest, a benchmark specifically designed to evaluate a model's ability to distinguish between interpretations of spoken sentences based on the stress pattern. We assess the performance of several leading SLMs and find that, despite their overall capabilities, they perform poorly on such tasks. To overcome this limitation, we propose a novel synthetic data generation pipeline, and create Stress17k, a training set that simulates change of meaning implied by stress variation. Then, we empirically show that optimizing models with this synthetic dataset aligns well with real-world recordings and enables effective finetuning of SLMs. Results suggest, that our finetuned model, StresSLM, significantly outperforms existing models on both sentence stress reasoning and detection tasks. Code, models, data, and audio samples - pages.cs.huji.ac.il/adiyoss-lab/stresstest.
Related papers
- Counterfactual reasoning: an analysis of in-context emergence [49.58529868457226]
Large-scale neural language models (LMs) exhibit remarkable performance in in-context learning.<n>This work studies in-context counterfactual reasoning in language models, that is, to predict the consequences of changes under hypothetical scenarios.
arXiv Detail & Related papers (2025-06-05T16:02:07Z) - WHISTRESS: Enriching Transcriptions with Sentence Stress Detection [20.802090523583196]
Sentence stress is crucial for conveying speaker intent in spoken language.<n>We introduce WHISTRESS, an alignment-free approach for enhancing transcription systems with sentence stress detection.<n>We train WHISTRESS on TINYSTRESS-15K and evaluate it against several competitive baselines.
arXiv Detail & Related papers (2025-05-25T11:45:08Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - Detecting Syllable-Level Pronunciation Stress with A Self-Attention
Model [0.0]
Knowing the stress level for each syllable of spoken English is important for English speakers and learners.
This paper presents a self-attention model to identify the stress level for each syllable of spoken English.
arXiv Detail & Related papers (2023-11-01T05:05:49Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - Adapting an ASR Foundation Model for Spoken Language Assessment [40.402050390096456]
A crucial part of an accurate and reliable spoken language assessment system is the underlying ASR model.
Recently, large-scale pre-trained ASR foundation models such as Whisper have been made available.
These models have a tendency to skip disfluencies and hesitations in the output.
Here a precise transcription of what a candidate said is needed.
arXiv Detail & Related papers (2023-07-13T16:01:58Z) - Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR)
We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters.
The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.