Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling
- URL: http://arxiv.org/abs/2009.02725v3
- Date: Sun, 23 May 2021 09:14:05 GMT
- Title: Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling
- Authors: Songxiang Liu, Yuewen Cao, Disong Wang, Xixin Wu, Xunying Liu, Helen
Meng
- Abstract summary: This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
- Score: 61.351967629600594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes an any-to-many location-relative, sequence-to-sequence
(seq2seq), non-parallel voice conversion approach, which utilizes text
supervision during training. In this approach, we combine a bottle-neck feature
extractor (BNE) with a seq2seq synthesis module. During the training stage, an
encoder-decoder-based hybrid connectionist-temporal-classification-attention
(CTC-attention) phoneme recognizer is trained, whose encoder has a bottle-neck
layer. A BNE is obtained from the phoneme recognizer and is utilized to extract
speaker-independent, dense and rich spoken content representations from
spectral features. Then a multi-speaker location-relative attention based
seq2seq synthesis model is trained to reconstruct spectral features from the
bottle-neck features, conditioning on speaker representations for speaker
identity control in the generated speech. To mitigate the difficulties of using
seq2seq models to align long sequences, we down-sample the input spectral
feature along the temporal dimension and equip the synthesis model with a
discretized mixture of logistic (MoL) attention mechanism. Since the phoneme
recognizer is trained with large speech recognition data corpus, the proposed
approach can conduct any-to-many voice conversion. Objective and subjective
evaluations show that the proposed any-to-many approach has superior voice
conversion performance in terms of both naturalness and speaker similarity.
Ablation studies are conducted to confirm the effectiveness of feature
selection and model design strategies in the proposed approach. The proposed VC
approach can readily be extended to support any-to-any VC (also known as
one/few-shot VC), and achieve high performance according to objective and
subjective evaluations.
Related papers
- Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling [14.98368067290024]
Takin-VC is a novel zero-shot VC framework based on jointly hybrid content and memory-augmented context-aware timbre modeling.
Experimental results demonstrate that the proposed Takin-VC method surpasses state-of-the-art zero-shot VC systems.
arXiv Detail & Related papers (2024-10-02T09:07:33Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Non-autoregressive real-time Accent Conversion model with voice cloning [0.0]
We have developed a non-autoregressive model for real-time accent conversion with voice cloning.
The model generates native-sounding L1 speech with minimal latency based on input L2 speech.
The model has the ability to save, clone and change the timbre, gender and accent of the speaker's voice in real time.
arXiv Detail & Related papers (2024-05-21T19:07:26Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Streaming end-to-end multi-talker speech recognition [34.76106500736099]
We propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition.
Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints.
Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT.
arXiv Detail & Related papers (2020-11-26T06:28:04Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z) - Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised
Discrete Speech Representations [49.55361944105796]
We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence framework.
A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker.
arXiv Detail & Related papers (2020-10-23T08:34:52Z) - NAUTILUS: a Versatile Voice Cloning System [44.700803634034486]
NAUTILUS can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker.
It can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm.
It achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech.
arXiv Detail & Related papers (2020-05-22T05:00:20Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.