Evaluating and Personalizing User-Perceived Quality of Text-to-Speech
Voices for Delivering Mindfulness Meditation with Different Physical
Embodiments
- URL: http://arxiv.org/abs/2401.03581v1
- Date: Sun, 7 Jan 2024 21:14:32 GMT
- Title: Evaluating and Personalizing User-Perceived Quality of Text-to-Speech
Voices for Delivering Mindfulness Meditation with Different Physical
Embodiments
- Authors: Zhonghao Shi, Han Chen, Anna-Maria Velentza, Siqi Liu, Nathaniel
Dennler, Allison O'Connell, and Maja Matari\'c
- Abstract summary: We evaluated the user-perceived quality of state-of-the-art text-to-speech voices for administering mindfulness meditation.
We found that the best-rated human voice was perceived better than all TTS voices.
By allowing users to fine-tune TTS voice features, the user-personalized TTS voices could perform almost as well as human voices.
- Score: 5.413055126487447
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Mindfulness-based therapies have been shown to be effective in improving
mental health, and technology-based methods have the potential to expand the
accessibility of these therapies. To enable real-time personalized content
generation for mindfulness practice in these methods, high-quality
computer-synthesized text-to-speech (TTS) voices are needed to provide verbal
guidance and respond to user performance and preferences. However, the
user-perceived quality of state-of-the-art TTS voices has not yet been
evaluated for administering mindfulness meditation, which requires emotional
expressiveness. In addition, work has not yet been done to study the effect of
physical embodiment and personalization on the user-perceived quality of TTS
voices for mindfulness. To that end, we designed a two-phase human subject
study. In Phase 1, an online Mechanical Turk between-subject study (N=471)
evaluated 3 (feminine, masculine, child-like) state-of-the-art TTS voices with
2 (feminine, masculine) human therapists' voices in 3 different physical
embodiment settings (no agent, conversational agent, socially assistive robot)
with remote participants. Building on findings from Phase 1, in Phase 2, an
in-person within-subject study (N=94), we used a novel framework we developed
for personalizing TTS voices based on user preferences, and evaluated
user-perceived quality compared to best-rated non-personalized voices from
Phase 1. We found that the best-rated human voice was perceived better than all
TTS voices; the emotional expressiveness and naturalness of TTS voices were
poorly rated, while users were satisfied with the clarity of TTS voices.
Surprisingly, by allowing users to fine-tune TTS voice features, the
user-personalized TTS voices could perform almost as well as human voices,
suggesting user personalization could be a simple and very effective tool to
improve user-perceived quality of TTS voice.
Related papers
- Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech [0.13654846342364302]
FEIM-TTS is a zero-shot text-to-speech model that synthesizes emotionally expressive speech aligned with facial images.
The model is trained using LRS3, CREMA-D, and MELD datasets, demonstrating its adaptability.
By integrating emotional nuances into TTS, our model enables dynamic and engaging auditory experiences for webcomics, allowing visually impaired users to enjoy these narratives more fully.
arXiv Detail & Related papers (2024-09-24T16:01:12Z) - Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training [14.323313455208183]
Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent.
We propose a TTS model that utilizes a Multi-Level Variational Autoencoder with adversarial learning to address accented speech synthesis and conversion.
arXiv Detail & Related papers (2024-06-03T05:56:02Z) - Creating New Voices using Normalizing Flows [16.747198180269127]
We investigate the ability of normalizing flows in text-to-speech (TTS) and voice conversion (VC) modes to extrapolate from speakers observed during training to create unseen speaker identities.
We use both objective and subjective metrics to benchmark our techniques on 2 evaluation tasks: zero-shot and new voice speech synthesis.
arXiv Detail & Related papers (2023-12-22T10:00:24Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z) - AdaSpeech: Adaptive Text to Speech for Custom Voice [104.69219752194863]
We propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices.
Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker.
arXiv Detail & Related papers (2021-03-01T13:28:59Z) - I-vector Based Within Speaker Voice Quality Identification on connected
speech [3.2116198597240846]
Voice disorders affect a large portion of the population, especially heavy voice users such as teachers or call-center workers.
Most voice disorders can be treated with behavioral voice therapy, which teaches patients to replace problematic, habituated voice production mechanics.
We built two systems that automatically differentiate various voice qualities produced by the same individual.
arXiv Detail & Related papers (2021-02-15T02:26:32Z) - VoiceCoach: Interactive Evidence-based Training for Voice Modulation
Skills in Public Speaking [55.366941476863644]
The modulation of voice properties, such as pitch, volume, and speed, is crucial for delivering a successful public speech.
We present VoiceCoach, an interactive evidence-based approach to facilitate the effective training of voice modulation skills.
arXiv Detail & Related papers (2020-01-22T04:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.