VoiceCoach: Interactive Evidence-based Training for Voice Modulation
Skills in Public Speaking
- URL: http://arxiv.org/abs/2001.07876v1
- Date: Wed, 22 Jan 2020 04:52:06 GMT
- Title: VoiceCoach: Interactive Evidence-based Training for Voice Modulation
Skills in Public Speaking
- Authors: Xingbo Wang, Haipeng Zeng, Yong Wang, Aoyu Wu, Zhida Sun, Xiaojuan Ma,
Huamin Qu
- Abstract summary: The modulation of voice properties, such as pitch, volume, and speed, is crucial for delivering a successful public speech.
We present VoiceCoach, an interactive evidence-based approach to facilitate the effective training of voice modulation skills.
- Score: 55.366941476863644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The modulation of voice properties, such as pitch, volume, and speed, is
crucial for delivering a successful public speech. However, it is challenging
to master different voice modulation skills. Though many guidelines are
available, they are often not practical enough to be applied in different
public speaking situations, especially for novice speakers. We present
VoiceCoach, an interactive evidence-based approach to facilitate the effective
training of voice modulation skills. Specifically, we have analyzed the voice
modulation skills from 2623 high-quality speeches (i.e., TED Talks) and use
them as the benchmark dataset. Given a voice input, VoiceCoach automatically
recommends good voice modulation examples from the dataset based on the
similarity of both sentence structures and voice modulation skills. Immediate
and quantitative visual feedback is provided to guide further improvement. The
expert interviews and the user study provide support for the effectiveness and
usability of VoiceCoach.
Related papers
- VoiceBench: Benchmarking LLM-Based Voice Assistants [58.84144494938931]
We introduce VoiceBench, the first benchmark to evaluate voice assistants based on large language models (LLMs)
VoiceBench includes both real and synthetic spoken instructions that incorporate the above three key real-world variations.
Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field.
arXiv Detail & Related papers (2024-10-22T17:15:20Z) - Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows.
Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues.
We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z) - Creating New Voices using Normalizing Flows [16.747198180269127]
We investigate the ability of normalizing flows in text-to-speech (TTS) and voice conversion (VC) modes to extrapolate from speakers observed during training to create unseen speaker identities.
We use both objective and subjective metrics to benchmark our techniques on 2 evaluation tasks: zero-shot and new voice speech synthesis.
arXiv Detail & Related papers (2023-12-22T10:00:24Z) - PerMod: Perceptually Grounded Voice Modification with Latent Diffusion
Models [5.588733538696248]
PerMod is a conditional latent diffusion model that takes in an input voice and a perceptual qualities vector.
Unlike prior work, PerMod generates a new voice corresponding to specific perceptual modifications.
We demonstrate that PerMod produces voices with the desired perceptual qualities for typical voices, but performs poorly on atypical voices.
arXiv Detail & Related papers (2023-12-13T20:14:27Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly
Disentangled Self-supervised Speech Representations [12.20522794248598]
We propose a zero-shot voice conversion method using speech representations trained with self-supervised learning.
We develop a multi-task model to decompose a speech utterance into features such as linguistic content, speaker characteristics, and speaking style.
Next, we develop a synthesis model with pitch and duration predictors that can effectively reconstruct the speech signal from its representation.
arXiv Detail & Related papers (2023-02-16T08:10:41Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - HiFi-VC: High Quality ASR-Based Voice Conversion [0.0]
We propose a new any-to-any voice conversion pipeline.
Our approach uses automated speech recognition features, pitch tracking, and a state-of-the-art waveform prediction model.
arXiv Detail & Related papers (2022-03-31T10:45:32Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - JukeBox: A Multilingual Singer Recognition Dataset [17.33151600403503]
textitJukeBox is a speaker recognition dataset with multilingual singing voice audio annotated with singer identity, gender, and language labels.
We use the current state-of-the-art methods to demonstrate the difficulty of performing speaker recognition on singing voice using models trained on spoken voice alone.
arXiv Detail & Related papers (2020-08-08T12:22:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.