Many-to-Many Voice Transformer Network
- URL: http://arxiv.org/abs/2005.08445v4
- Date: Fri, 6 Nov 2020 22:46:52 GMT
- Title: Many-to-Many Voice Transformer Network
- Authors: Hirokazu Kameoka, Wen-Chin Huang, Kou Tanaka, Takuhiro Kaneko,
Nobukatsu Hojo, and Tomoki Toda
- Abstract summary: This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
- Score: 55.17770019619078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a voice conversion (VC) method based on a
sequence-to-sequence (S2S) learning framework, which enables simultaneous
conversion of the voice characteristics, pitch contour, and duration of input
speech. We previously proposed an S2S-based VC method using a transformer
network architecture called the voice transformer network (VTN). The original
VTN was designed to learn only a mapping of speech feature sequences from one
speaker to another. The main idea we propose is an extension of the original
VTN that can simultaneously learn mappings among multiple speakers. This
extension called the many-to-many VTN makes it able to fully use available
training data collected from multiple speakers by capturing common latent
features that can be shared across different speakers. It also allows us to
introduce a training loss called the identity mapping loss to ensure that the
input feature sequence will remain unchanged when the source and target speaker
indices are the same. Using this particular loss for model training has been
found to be extremely effective in improving the performance of the model at
test time. We conducted speaker identity conversion experiments and found that
our model obtained higher sound quality and speaker similarity than baseline
methods. We also found that our model, with a slight modification to its
architecture, could handle any-to-many conversion tasks reasonably well.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge
transfer from voice conversion [77.50171525265056]
This paper proposes a novel multi-speaker Video-to-Speech (VTS) system based on cross-modal knowledge transfer from voice conversion (VC)
The Lip2Ind network can substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content.
arXiv Detail & Related papers (2022-02-18T08:58:45Z) - Training Robust Zero-Shot Voice Conversion Models with Self-supervised
Features [24.182732872327183]
Unsampling Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker.
We show that high-quality audio samples can be achieved by using a length resupervised decoder.
arXiv Detail & Related papers (2021-12-08T17:27:39Z) - StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for
Natural-Sounding Voice Conversion [19.74933410443264]
We present an unsupervised many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2.
Our model is trained only with 20 English speakers.
It generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
arXiv Detail & Related papers (2021-07-21T23:44:17Z) - An Adaptive Learning based Generative Adversarial Network for One-To-One
Voice Conversion [9.703390665821463]
We propose an adaptive learning-based GAN model called ALGAN-VC for an efficient one-to-one VC of speakers.
The model is tested on Voice Conversion Challenge (VCC) 2016, 2018, and 2020 datasets as well as on our self-prepared speech dataset.
A subjective and objective evaluation of the generated speech samples indicated that the proposed model elegantly performed the voice conversion task.
arXiv Detail & Related papers (2021-04-25T13:44:32Z) - Streaming end-to-end multi-talker speech recognition [34.76106500736099]
We propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition.
Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints.
Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT.
arXiv Detail & Related papers (2020-11-26T06:28:04Z) - Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised
Discrete Speech Representations [49.55361944105796]
We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence framework.
A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker.
arXiv Detail & Related papers (2020-10-23T08:34:52Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Many-to-Many Voice Conversion using Conditional Cycle-Consistent
Adversarial Networks [3.1317409221921144]
We extend the CycleGAN by conditioning the network on speakers.
The proposed method can perform many-to-many voice conversion among multiple speakers using a single generative adversarial network (GAN)
Compared to building multiple CycleGANs for each pair of speakers, the proposed method reduces the computational and spatial cost significantly without compromising the sound quality of the converted voice.
arXiv Detail & Related papers (2020-02-15T06:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.