No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation
- URL: http://arxiv.org/abs/2405.09708v1
- Date: Wed, 15 May 2024 21:28:55 GMT
- Title: No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation
- Authors: Qiaoqiao Ren, Yuanbo Hou, Dick Botteldooren, Tony Belpaeme,
- Abstract summary: We conduct a speech comprehension study involving 39 participants.
Experiment's primary outcome shows that spaces with good acoustic quality positively correlate with intelligibility and user experience.
We develop a convolutional neural network model to adapt the robot's speech parameters to different users and spaces.
- Score: 7.675340768192281
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Spoken language interaction is at the heart of interpersonal communication, and people flexibly adapt their speech to different individuals and environments. It is surprising that robots, and by extension other digital devices, are not equipped to adapt their speech and instead rely on fixed speech parameters, which often hinder comprehension by the user. We conducted a speech comprehension study involving 39 participants who were exposed to different environmental and contextual conditions. During the experiment, the robot articulated words using different vocal parameters, and the participants were tasked with both recognising the spoken words and rating their subjective impression of the robot's speech. The experiment's primary outcome shows that spaces with good acoustic quality positively correlate with intelligibility and user experience. However, increasing the distance between the user and the robot exacerbated the user experience, while distracting background sounds significantly reduced speech recognition accuracy and user satisfaction. We next built an adaptive voice for the robot. For this, the robot needs to know how difficult it is for a user to understand spoken language in a particular setting. We present a prediction model that rates how annoying the ambient acoustic environment is and, consequentially, how hard it is to understand someone in this setting. Then, we develop a convolutional neural network model to adapt the robot's speech parameters to different users and spaces, while taking into account the influence of ambient acoustics on intelligibility. Finally, we present an evaluation with 27 users, demonstrating superior intelligibility and user experience with adaptive voice parameters compared to fixed voice.
Related papers
- Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - SIFToM: Robust Spoken Instruction Following through Theory of Mind [51.326266354164716]
We present a cognitively inspired model, Speech Instruction Following through Theory of Mind (SIFToM), to enable robots to pragmatically follow human instructions under diverse speech conditions.
Results show that the SIFToM model outperforms state-of-the-art speech and language models, approaching human-level accuracy on challenging speech instruction following tasks.
arXiv Detail & Related papers (2024-09-17T02:36:10Z) - Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation [0.6964027823688135]
Modern conversational systems lack emotional depth and disfluent characteristic of human interactions.
To address this shortcoming, we have designed an innovative speech synthesis pipeline.
Within this framework, a cutting-edge language model introduces both human-like emotion and disfluencies in a zero-shot setting.
arXiv Detail & Related papers (2024-03-31T00:38:02Z) - Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents [5.244401764969407]
Embodied agents, in the form of virtual agents or social robots, are rapidly becoming more widespread.
We propose a novel framework that can generate sequences of joint angles from the speech text and speech audio utterances.
arXiv Detail & Related papers (2023-09-17T18:46:25Z) - Read the Room: Adapting a Robot's Voice to Ambient and Social Contexts [1.0732907121422146]
We describe a process and results toward selecting robot voice styles for perceived social appropriateness and ambiance awareness.
Our results with N=120 participants provide evidence that the choice of voice style in different ambiances impacted a robot's perceived intelligence.
arXiv Detail & Related papers (2022-05-10T15:10:23Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Whither the Priors for (Vocal) Interactivity? [6.709659274527638]
Speech-based communication is often cited as one of the most natural' ways in which humans and robots might interact.
Despite this, the resulting interactions are anything but natural'
It is argued here that such communication failures are indicative of a deeper malaise.
arXiv Detail & Related papers (2022-03-16T12:06:46Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - Self-supervised reinforcement learning for speaker localisation with the
iCub humanoid robot [58.2026611111328]
Looking at a person's face is one of the mechanisms that humans rely on when it comes to filtering speech in noisy environments.
Having a robot that can look toward a speaker could benefit ASR performance in challenging environments.
We propose a self-supervised reinforcement learning-based framework inspired by the early development of humans.
arXiv Detail & Related papers (2020-11-12T18:02:15Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.