SpeechSplit 2.0: Unsupervised speech disentanglement for voice
conversion Without tuning autoencoder Bottlenecks
- URL: http://arxiv.org/abs/2203.14156v1
- Date: Sat, 26 Mar 2022 21:01:26 GMT
- Title: SpeechSplit 2.0: Unsupervised speech disentanglement for voice
conversion Without tuning autoencoder Bottlenecks
- Authors: Chak Ho Chan, Kaizhi Qian, Yang Zhang, Mark Hasegawa-Johnson
- Abstract summary: SpeechSplit can perform aspect-specific voice conversion by disentangling speech into content, rhythm, pitch, and timbre using multiple autoencoders.
This paper proposes SpeechSplit 2.0, which constrains the information flow of the speech component to be disentangled on the autoencoder input.
- Score: 39.67320815230375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: SpeechSplit can perform aspect-specific voice conversion by disentangling
speech into content, rhythm, pitch, and timbre using multiple autoencoders in
an unsupervised manner. However, SpeechSplit requires careful tuning of the
autoencoder bottlenecks, which can be time-consuming and less robust. This
paper proposes SpeechSplit 2.0, which constrains the information flow of the
speech component to be disentangled on the autoencoder input using efficient
signal processing methods instead of bottleneck tuning. Evaluation results show
that SpeechSplit 2.0 achieves comparable performance to SpeechSplit in speech
disentanglement and superior robustness to the bottleneck size variations. Our
code is available at https://github.com/biggytruck/SpeechSplit2.
Related papers
- Efficient Streaming LLM for Speech Recognition [23.151980358518102]
SpeechLLM-XL is a linear scaling decoder-only model for streaming speech recognition.
It achieves no quality degradation on long form utterances 10x longer than the training utterances.
arXiv Detail & Related papers (2024-10-02T01:54:35Z) - vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders [26.00129172101188]
We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC)
We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task.
We show vec2wav 2.0 achieves competitive cross-lingual VC even only trained on monolingual corpus.
arXiv Detail & Related papers (2024-09-03T15:41:07Z) - TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head
Translation [54.155138561698514]
Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning.
Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors.
We propose a model for talking head translation, textbfTransFace, which can directly translate audio-visual speech into audio-visual speech in other languages.
arXiv Detail & Related papers (2023-12-23T08:45:57Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z) - NVC-Net: End-to-End Adversarial Voice Conversion [7.14505983271756]
NVC-Net is an end-to-end adversarial network that performs voice conversion directly on the raw audio waveform of arbitrary length.
Our model is capable of producing samples at a rate of more than 3600 kHz on an NVIDIA V100 GPU, being orders of magnitude faster than state-of-the-art methods.
arXiv Detail & Related papers (2021-06-02T07:19:58Z) - VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net
architecture [71.45920122349628]
Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity.
We use the U-Net architecture within an auto-encoder-based VC system to improve audio quality.
arXiv Detail & Related papers (2020-06-07T14:01:16Z) - Unsupervised Speech Decomposition via Triple Information Bottleneck [63.55007056410914]
Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm.
We propose SpeechSplit, which can blindly decompose speech into its four components by introducing three carefully designed information bottlenecks.
arXiv Detail & Related papers (2020-04-23T16:12:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.