Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication
- URL: http://arxiv.org/abs/2503.17479v1
- Date: Fri, 21 Mar 2025 18:50:05 GMT
- Title: Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication
- Authors: Yiwen Xu, Monideep Chakraborti, Tianyi Zhang, Katelyn Eng, Aanchan Mohan, Mirjana Prpa,
- Abstract summary: Speak Ease is an augmentative and alternative communication system to support users' expressivity.<n>System integrates multimodal input, including text, voice, and contextual cues, with large language models.
- Score: 9.812902134556971
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper, we present Speak Ease: an augmentative and alternative communication (AAC) system to support users' expressivity by integrating multimodal input, including text, voice, and contextual cues (conversational partner and emotional tone), with large language models (LLMs). Speak Ease combines automatic speech recognition (ASR), context-aware LLM-based outputs, and personalized text-to-speech technologies to enable more personalized, natural-sounding, and expressive communication. Through an exploratory feasibility study and focus group evaluation with speech and language pathologists (SLPs), we assessed Speak Ease's potential to enable expressivity in AAC. The findings highlight the priorities and needs of AAC users and the system's ability to enhance user expressivity by supporting more personalized and contextually relevant communication. This work provides insights into the use of multimodal inputs and LLM-driven features to improve AAC systems and support expressivity.
Related papers
- VoiceBench: Benchmarking LLM-Based Voice Assistants [58.84144494938931]
We introduce VoiceBench, the first benchmark to evaluate voice assistants based on large language models (LLMs)
VoiceBench includes both real and synthetic spoken instructions that incorporate the above three key real-world variations.
Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field.
arXiv Detail & Related papers (2024-10-22T17:15:20Z) - DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech [14.323313455208183]
We propose a novel approach to disentangle speaker and accent representations using multi-level variational autoencoders (ML-VAE) and vector quantization (VQ)
Our proposed method addresses the challenge of effectively separating speaker and accent characteristics, enabling more fine-grained control over the synthesized speech.
arXiv Detail & Related papers (2024-10-17T08:51:46Z) - Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech [29.847183061204436]
Large language models (LLMs) can take into account users' emotions or speaking styles when providing their responses.
In this work, we utilize an end-to-end system with a speech encoder.
We find that this training framework allows the encoder to generate tokens that capture both semantic and paralinguistic information in speech.
arXiv Detail & Related papers (2024-10-02T01:32:47Z) - DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)<n>We present a simple yet effective automatic process for creating speech-text pair data.<n>Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions [152.41217651729738]
We propose the EMOVA (EMotionally Omni-present Voice Assistant) to enable Large Language Models with end-to-end speech abilities.<n>With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities.<n>For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks.
arXiv Detail & Related papers (2024-09-26T16:44:02Z) - Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.
Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.
Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z) - Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation [46.93969003104427]
This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM)<n>USDM is designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech.<n>Our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines.
arXiv Detail & Related papers (2024-02-08T14:35:09Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Voice Privacy with Smart Digital Assistants in Educational Settings [1.8369974607582578]
We design and evaluate a practical and efficient framework for voice privacy at the source.
The approach combines speaker identification (SID) and speech conversion methods to randomly disguise the identity of users right on the device that records the speech.
We evaluate the ASR performance of the conversion in terms of word error rate and show the promise of this framework in preserving the content of the input speech.
arXiv Detail & Related papers (2021-03-24T19:58:45Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.