Detecting Syllable-Level Pronunciation Stress with A Self-Attention
Model
- URL: http://arxiv.org/abs/2311.00301v1
- Date: Wed, 1 Nov 2023 05:05:49 GMT
- Title: Detecting Syllable-Level Pronunciation Stress with A Self-Attention
Model
- Authors: Wang Weiying and Nakajima Akinori
- Abstract summary: Knowing the stress level for each syllable of spoken English is important for English speakers and learners.
This paper presents a self-attention model to identify the stress level for each syllable of spoken English.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One precondition of effective oral communication is that words should be
pronounced clearly, especially for non-native speakers. Word stress is the key
to clear and correct English, and misplacement of syllable stress may lead to
misunderstandings. Thus, knowing the stress level is important for English
speakers and learners. This paper presents a self-attention model to identify
the stress level for each syllable of spoken English. Various prosodic and
categorical features, including the pitch level, intensity, duration and type
of the syllable and its nuclei (the vowel of the syllable), are explored. These
features are input to the self-attention model, and syllable-level stresses are
predicted. The simplest model yields an accuracy of over 88% and 93% on
different datasets, while more advanced models provide higher accuracy. Our
study suggests that the self-attention model can be promising in stress-level
detection. These models could be applied to various scenarios, such as online
meetings and English learning.
Related papers
- Word stress in self-supervised speech models: A cross-linguistic comparison [6.552278017383513]
We study word stress representations learned by self-supervised speech models (S3M)<n>We investigate the S3M representations of word stress for five different languages.
arXiv Detail & Related papers (2025-07-07T08:10:26Z) - StressTest: Can YOUR Speech LM Handle the Stress? [20.802090523583196]
Sentence stress refers to emphasis placed on specific words within a spoken utterance to highlight or contrast an idea, or to introduce new information.<n>Recent advances in speech-aware language models (SLMs) have enabled direct processing of audio.<n>Despite the crucial role of sentence stress in shaping meaning and speaker intent, it remains largely overlooked in evaluation and development of such models.
arXiv Detail & Related papers (2025-05-28T18:32:56Z) - Sylber: Syllabic Embedding Representation of Speech from Raw Audio [25.703703711031178]
We propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure.
Specifically, we propose a self-supervised model that regresses features on syllabic segments distilled from a teacher model which is an exponential moving average of the model in training.
This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) syllabic units better suited for lexical and syntactic understanding.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - Exploring Automated Keyword Mnemonics Generation with Large Language Models via Overgenerate-and-Rank [4.383205675898942]
Keywords mnemonics are a technique for memorizing vocabulary through memorable associations with a target word via a verbal cue.
We propose a novel overgenerate-and-rank method via prompting large language models to generate verbal cues.
Results show that LLM-generated mnemonics are comparable to human-generated ones in terms of imageability, coherence, and perceived usefulness.
arXiv Detail & Related papers (2024-09-21T00:00:18Z) - Probing self-attention in self-supervised speech models for cross-linguistic differences [0.0]
We study the self-attention mechanisms of one small self-supervised speech transformer model (TERA)
We find that even with a small model, the attention heads learned are diverse ranging from almost entirely diagonal to almost entirely global regardless of the training language.
We highlight some notable differences in attention patterns between Turkish and English and demonstrate that the models do learn important phonological information during pretraining.
arXiv Detail & Related papers (2024-09-04T22:47:33Z) - Speaker Embeddings as Individuality Proxy for Voice Stress Detection [14.332772222772668]
Since the mental states of the speaker modulate speech, stress introduced by cognitive or physical loads could be detected in the voice.
The existing voice stress detection benchmark has shown that the audio embeddings extracted from the Hybrid BYOL-S self-supervised model perform well.
This paper presents the design and development of voice stress detection, trained on more than 100 speakers from 9 language groups and five different types of stress.
arXiv Detail & Related papers (2023-06-09T14:11:07Z) - Supervised Acoustic Embeddings And Their Transferability Across
Languages [2.28438857884398]
In speech recognition, it is essential to model the phonetic content of the input signal while discarding irrelevant factors such as speaker variations and noise.
Self-supervised pre-training has been proposed as a way to improve both supervised and unsupervised speech recognition.
arXiv Detail & Related papers (2023-01-03T09:37:24Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Are Some Words Worth More than Others? [3.5598388686985354]
We propose two new intrinsic evaluation measures within the framework of a simple word prediction task.
We evaluate several commonly-used large English language models using our proposed metrics.
arXiv Detail & Related papers (2020-10-12T23:12:11Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z) - Towards Zero-shot Learning for Automatic Phonemic Transcription [82.9910512414173]
A more challenging problem is to build phonemic transcribers for languages with zero training data.
Our model is able to recognize unseen phonemes in the target language without any training data.
It achieves 7.7% better phoneme error rate on average over a standard multilingual model.
arXiv Detail & Related papers (2020-02-26T20:38:42Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.