Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders
- URL: http://arxiv.org/abs/2502.13983v1
- Date: Tue, 18 Feb 2025 14:15:55 GMT
- Title: Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders
- Authors: Seungbae Kim, Daeun Lee, Brielle Stark, Jinyoung Han,
- Abstract summary: We propose a gesture-aware Automatic Speech Recognition (ASR) system with zero-shot learning for individuals with speech impairments.
Experiment results and analyses show that including gesture information significantly enhances semantic understanding.
- Score: 10.664605070306417
- License:
- Abstract: Individuals with language disorders often face significant communication challenges due to their limited language processing and comprehension abilities, which also affect their interactions with voice-assisted systems that mostly rely on Automatic Speech Recognition (ASR). Despite advancements in ASR that address disfluencies, there has been little attention on integrating non-verbal communication methods, such as gestures, which individuals with language disorders substantially rely on to supplement their communication. Recognizing the need to interpret the latent meanings of visual information not captured by speech alone, we propose a gesture-aware ASR system utilizing a multimodal large language model with zero-shot learning for individuals with speech impairments. Our experiment results and analyses show that including gesture information significantly enhances semantic understanding. This study can help develop effective communication technologies, specifically designed to meet the unique needs of individuals with language impairments.
Related papers
- SIFToM: Robust Spoken Instruction Following through Theory of Mind [51.326266354164716]
We present a cognitively inspired model, Speech Instruction Following through Theory of Mind (SIFToM), to enable robots to pragmatically follow human instructions under diverse speech conditions.
Results show that the SIFToM model outperforms state-of-the-art speech and language models, approaching human-level accuracy on challenging speech instruction following tasks.
arXiv Detail & Related papers (2024-09-17T02:36:10Z) - Scaling up Multimodal Pre-training for Sign Language Understanding [96.17753464544604]
Sign language serves as the primary meaning of communication for the deaf-mute community.
To facilitate communication between the deaf-mute and hearing people, a series of sign language understanding (SLU) tasks have been studied.
These tasks investigate sign language topics from diverse perspectives and raise challenges in learning effective representation of sign language videos.
arXiv Detail & Related papers (2024-08-16T06:04:25Z) - Nonverbal Interaction Detection [83.40522919429337]
This work addresses a new challenge of understanding human nonverbal interaction in social contexts.
We contribute a novel large-scale dataset, called NVI, which is meticulously annotated to include bounding boxes for humans and corresponding social groups.
Second, we establish a new task NVI-DET for nonverbal interaction detection, which is formalized as identifying triplets in the form individual, group, interaction> from images.
Third, we propose a nonverbal interaction detection hypergraph (NVI-DEHR), a new approach that explicitly models high-order nonverbal interactions using hypergraphs.
arXiv Detail & Related papers (2024-07-11T02:14:06Z) - The timing bottleneck: Why timing and overlap are mission-critical for
conversational user interfaces, speech recognition and dialogue systems [0.11470070927586018]
We evaluate 5 major commercial ASR systems for their conversational and multilingual support.
We find that word error rates for natural conversational data in 6 languages remain abysmal, and that overlap remains a key challenge.
Our findings help to evaluate the current state of conversational ASR, contribute towards multidimensional error analysis and evaluation, and identify phenomena that need most attention on the way to build robust interactive speech technologies.
arXiv Detail & Related papers (2023-07-28T11:38:05Z) - A Comparative Analysis of Techniques and Algorithms for Recognising Sign
Language [0.9311364633437358]
Sign language is frequently used as the primary form of communication by people with hearing loss.
It is necessary to create human-computer interface systems that can offer hearing-impaired people a social platform.
Most commercial sign language translation systems are sensor-based, pricey, and challenging to use.
arXiv Detail & Related papers (2023-05-05T10:52:18Z) - Word level Bangla Sign Language Dataset for Continuous BSL Recognition [0.0]
We develop an attention-based Bi-GRU model that captures the temporal dynamics of pose information for individuals communicating through sign language.
The accuracy of the model is reported to be 85.64%.
arXiv Detail & Related papers (2023-02-22T18:55:54Z) - Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded
Language from Percepts and Raw Speech [26.076534338576234]
Learning to understand grounded language, which connects natural language to percepts, is a critical research area.
In this work we demonstrate the feasibility of performing grounded language acquisition on paired visual percepts and raw speech inputs.
arXiv Detail & Related papers (2021-12-27T16:12:30Z) - Hierarchical Summarization for Longform Spoken Dialog [1.995792341399967]
Despite the pervasiveness of spoken dialog, automated speech understanding and quality information extraction remains markedly poor.
Compared to understanding text, auditory communication poses many additional challenges such as speaker disfluencies, informal prose styles, and lack of structure.
We propose a two stage ASR and text summarization pipeline and propose a set of semantic segmentation and merging algorithms to resolve these speech modeling challenges.
arXiv Detail & Related papers (2021-08-21T23:31:31Z) - Few-shot Language Coordination by Modeling Theory of Mind [95.54446989205117]
We study the task of few-shot $textitlanguage coordination$.
We require the lead agent to coordinate with a $textitpopulation$ of agents with different linguistic abilities.
This requires the ability to model the partner's beliefs, a vital component of human communication.
arXiv Detail & Related papers (2021-07-12T19:26:11Z) - Speaker De-identification System using Autoencoders and Adversarial
Training [58.720142291102135]
We propose a speaker de-identification system based on adversarial training and autoencoders.
Experimental results show that combining adversarial learning and autoencoders increase the equal error rate of a speaker verification system.
arXiv Detail & Related papers (2020-11-09T19:22:05Z) - You Impress Me: Dialogue Generation via Mutual Persona Perception [62.89449096369027]
The research in cognitive science suggests that understanding is an essential signal for a high-quality chit-chat conversation.
Motivated by this, we propose P2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding.
arXiv Detail & Related papers (2020-04-11T12:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.