Vocoder-free End-to-End Voice Conversion with Transformer Network
- URL: http://arxiv.org/abs/2002.03808v1
- Date: Wed, 5 Feb 2020 06:19:24 GMT
- Title: Vocoder-free End-to-End Voice Conversion with Transformer Network
- Authors: June-Woo Kim, Ho-Young Jung, Minho Lee
- Abstract summary: Mel-frequency filter bank (MFB) based approaches have the advantage of learning speech compared to raw spectrum since MFB has less feature size.
It is possible to only use the raw spectrum along with the phase to generate different style of voices with clear pronunciation.
In this paper, we introduce a vocoder-free end-to-end voice conversion method using transformer network.
- Score: 5.5792083698526405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mel-frequency filter bank (MFB) based approaches have the advantage of
learning speech compared to raw spectrum since MFB has less feature size.
However, speech generator with MFB approaches require additional vocoder that
needs a huge amount of computation expense for training process. The additional
pre/post processing such as MFB and vocoder is not essential to convert real
human speech to others. It is possible to only use the raw spectrum along with
the phase to generate different style of voices with clear pronunciation. In
this regard, we propose a fast and effective approach to convert realistic
voices using raw spectrum in a parallel manner. Our transformer-based model
architecture which does not have any CNN or RNN layers has shown the advantage
of learning fast and solved the limitation of sequential computation of
conventional RNN. In this paper, we introduce a vocoder-free end-to-end voice
conversion method using transformer network. The presented conversion model can
also be used in speaker adaptation for speech recognition. Our approach can
convert the source voice to a target voice without using MFB and vocoder. We
can get an adapted MFB for speech recognition by multiplying the converted
magnitude with phase. We perform our voice conversion experiments on TIDIGITS
dataset using the metrics such as naturalness, similarity, and clarity with
mean opinion score, respectively.
Related papers
- Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and
Textually Described Voices [28.998590651956153]
We look at four non-standard applications: stuttered voice conversion, cross-lingual voice conversion, musical instrument conversion, and text-to-voice conversion.
We find that kNN-VC retains high performance in stuttered and cross-lingual voice conversion.
Results are more mixed for the musical instrument and text-to-voice conversion tasks.
arXiv Detail & Related papers (2023-10-12T08:00:25Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for
Natural-Sounding Voice Conversion [19.74933410443264]
We present an unsupervised many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2.
Our model is trained only with 20 English speakers.
It generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
arXiv Detail & Related papers (2021-07-21T23:44:17Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - End-to-End Whisper to Natural Speech Conversion using Modified
Transformer Network [0.8399688944263843]
We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach.
We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features.
The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation.
arXiv Detail & Related papers (2020-04-20T14:47:46Z) - Transformer Transducer: A Streamable Speech Recognition Model with
Transformer Encoders and RNN-T Loss [14.755108017449295]
We present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system.
Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently.
We present results on the LibriSpeech dataset showing that limiting the left context for self-attention makes decoding computationally tractable for streaming.
arXiv Detail & Related papers (2020-02-07T00:04:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.