Empowering Dysarthric Speech: Leveraging Advanced LLMs for Accurate Speech Correction and Multimodal Emotion Analysis
- URL: http://arxiv.org/abs/2410.12867v1
- Date: Sun, 13 Oct 2024 20:54:44 GMT
- Title: Empowering Dysarthric Speech: Leveraging Advanced LLMs for Accurate Speech Correction and Multimodal Emotion Analysis
- Authors: Kaushal Attaluri, Anirudh CHVS, Sireesha Chittepu,
- Abstract summary: This paper introduces a novel approach to recognizing and translating dysarthric speech.
We leverage advanced large language models for accurate speech correction and multimodal emotion analysis.
Our framework identifies emotions such as happiness, sadness, neutrality, surprise, anger, and fear, while reconstructing intended sentences from distorted speech with high accuracy.
- Score: 0.0
- License:
- Abstract: Dysarthria is a motor speech disorder caused by neurological damage that affects the muscles used for speech production, leading to slurred, slow, or difficult-to-understand speech. It affects millions of individuals worldwide, including those with conditions such as stroke, traumatic brain injury, cerebral palsy, Parkinsons disease, and multiple sclerosis. Dysarthria presents a major communication barrier, impacting quality of life and social interaction. This paper introduces a novel approach to recognizing and translating dysarthric speech, empowering individuals with this condition to communicate more effectively. We leverage advanced large language models for accurate speech correction and multimodal emotion analysis. Dysarthric speech is first converted to text using OpenAI Whisper model, followed by sentence prediction using fine-tuned open-source models and benchmark models like GPT-4.o, LLaMA 3.1 70B and Mistral 8x7B on Groq AI accelerators. The dataset used combines the TORGO dataset with Google speech data, manually labeled for emotional context. Our framework identifies emotions such as happiness, sadness, neutrality, surprise, anger, and fear, while reconstructing intended sentences from distorted speech with high accuracy. This approach demonstrates significant advancements in the recognition and interpretation of dysarthric speech.
Related papers
- Geometry of orofacial neuromuscular signals: speech articulation decoding using surface electromyography [0.0]
Millions of individuals lose the ability to speak intelligibly due to neuromuscular disease, stroke, trauma, and head/neck cancer surgery.
Noninvasive surface electromyography (sEMG) has shown promise to restore speech output in these individuals.
The goal is to collect sEMG signals from multiple articulatory sites as people silently produce speech and then decode the signals to enable fluent and natural communication.
arXiv Detail & Related papers (2024-11-04T20:31:22Z) - Accurate synthesis of Dysarthric Speech for ASR data augmentation [5.223856537504927]
Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility.
This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation.
arXiv Detail & Related papers (2023-08-16T15:42:24Z) - EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech [0.0]
State-of-the-art speech models try to get as close as possible to the human voice.
Modelling emotions is an essential part of Text-To-Speech (TTS) research.
EmoSpeech surpasses existing models regarding both MOS score and emotion recognition accuracy in generated speech.
arXiv Detail & Related papers (2023-06-28T19:34:16Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - Cross-lingual Self-Supervised Speech Representations for Improved
Dysarthric Speech Recognition [15.136348385992047]
This study explores the usefulness of using Wav2Vec self-supervised speech representations as features for training an ASR system for dysarthric speech.
We train an acoustic model with features extracted from Wav2Vec, Hubert, and the cross-lingual XLSR model.
Results suggest that speech representations pretrained on large unlabelled data can improve word error rate (WER) performance.
arXiv Detail & Related papers (2022-04-04T17:36:01Z) - Speaker Identity Preservation in Dysarthric Speech Reconstruction by
Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech.
Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity.
We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in
Real-Time MRI [9.614694312155798]
We propose a novel deep neural network-based learning framework that understands acoustic information in the variable-length sequence of vocal tract shaping during speech production.
The proposed framework comprises of convolutions, recurrent network, and connectionist temporal classification loss, trained entirely end-to-end.
To the best of our knowledge, this is the first study that demonstrates the recognition of entire spoken sentence based on an individual's arttory motions captured by rtMRI video.
arXiv Detail & Related papers (2021-06-16T11:20:02Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.