Speak Like a Dog: Human to Non-human creature Voice Conversion
- URL: http://arxiv.org/abs/2206.04780v1
- Date: Thu, 9 Jun 2022 22:10:43 GMT
- Title: Speak Like a Dog: Human to Non-human creature Voice Conversion
- Authors: Kohei Suzuki, Shoki Sakamoto, Tadahiro Taniguchi, Hirokazu Kameoka
- Abstract summary: H2NH-VC aims to convert human speech into non-human creature-like speech.
To clarify the possibilities and characteristics of the "speak like a dog" task, we conducted a comparative experiment.
The converted voices were evaluated using mean opinion scores: dog-likeness, sound quality and intelligibility, and character error rate (CER)
- Score: 19.703397078178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a new voice conversion (VC) task from human speech to
dog-like speech while preserving linguistic information as an example of human
to non-human creature voice conversion (H2NH-VC) tasks. Although most VC
studies deal with human to human VC, H2NH-VC aims to convert human speech into
non-human creature-like speech. Non-parallel VC allows us to develop H2NH-VC,
because we cannot collect a parallel dataset that non-human creatures speak
human language. In this study, we propose to use dogs as an example of a
non-human creature target domain and define the "speak like a dog" task. To
clarify the possibilities and characteristics of the "speak like a dog" task,
we conducted a comparative experiment using existing representative
non-parallel VC methods in acoustic features (Mel-cepstral coefficients and
Mel-spectrograms), network architectures (five different kernel-size settings),
and training criteria (variational autoencoder (VAE)- based and generative
adversarial network-based). Finally, the converted voices were evaluated using
mean opinion scores: dog-likeness, sound quality and intelligibility, and
character error rate (CER). The experiment showed that the employment of the
Mel-spectrogram improved the dog-likeness of the converted speech, while it is
challenging to preserve linguistic information. Challenges and limitations of
the current VC methods for H2NH-VC are highlighted.
Related papers
- Towards Dog Bark Decoding: Leveraging Human Speech Processing for Automated Bark Classification [23.974783158267428]
We explore the use of self-supervised speech representation models pre-trained on human speech to address dog bark classification tasks.
We show that using speech embedding representations significantly improves over simpler classification baselines.
We also find that models pre-trained on large human speech acoustics can provide additional performance boosts on several tasks.
arXiv Detail & Related papers (2024-04-29T14:41:59Z) - SpeechAlign: Aligning Speech Generation to Human Preferences [51.684183257809075]
We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences.
We show that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model.
arXiv Detail & Related papers (2024-04-08T15:21:17Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - Time out of Mind: Generating Rate of Speech conditioned on emotion and
speaker [0.0]
We train a GAN conditioned on emotion to generate worth lengths for a given input text.
These word lengths are relative neutral speech and can be provided to a text-to-speech system to generate more expressive speech.
We were able to achieve better performances on objective measures for neutral speech, and better time alignment for happy speech when compared to an out-of-box model.
arXiv Detail & Related papers (2023-01-29T02:58:01Z) - Toward a realistic model of speech processing in the brain with
self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate.
We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z) - Training Robust Zero-Shot Voice Conversion Models with Self-supervised
Features [24.182732872327183]
Unsampling Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker.
We show that high-quality audio samples can be achieved by using a length resupervised decoder.
arXiv Detail & Related papers (2021-12-08T17:27:39Z) - V2C: Visual Voice Cloning [55.55301826567474]
We propose a new task named Visual Voice Cloning (V2C)
V2C seeks to convert a paragraph of text to a speech with both desired voice specified by a reference audio and desired emotion specified by a reference video.
Our dataset contains 10,217 animated movie clips covering a large variety of genres.
arXiv Detail & Related papers (2021-11-25T03:35:18Z) - StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for
Natural-Sounding Voice Conversion [19.74933410443264]
We present an unsupervised many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2.
Our model is trained only with 20 English speakers.
It generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
arXiv Detail & Related papers (2021-07-21T23:44:17Z) - An Adaptive Learning based Generative Adversarial Network for One-To-One
Voice Conversion [9.703390665821463]
We propose an adaptive learning-based GAN model called ALGAN-VC for an efficient one-to-one VC of speakers.
The model is tested on Voice Conversion Challenge (VCC) 2016, 2018, and 2020 datasets as well as on our self-prepared speech dataset.
A subjective and objective evaluation of the generated speech samples indicated that the proposed model elegantly performed the voice conversion task.
arXiv Detail & Related papers (2021-04-25T13:44:32Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net
architecture [71.45920122349628]
Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity.
We use the U-Net architecture within an auto-encoder-based VC system to improve audio quality.
arXiv Detail & Related papers (2020-06-07T14:01:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.