Unsupervised Speech Decomposition via Triple Information Bottleneck
- URL: http://arxiv.org/abs/2004.11284v6
- Date: Sat, 13 Mar 2021 15:31:35 GMT
- Title: Unsupervised Speech Decomposition via Triple Information Bottleneck
- Authors: Kaizhi Qian, Yang Zhang, Shiyu Chang, David Cox, Mark Hasegawa-Johnson
- Abstract summary: Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm.
We propose SpeechSplit, which can blindly decompose speech into its four components by introducing three carefully designed information bottlenecks.
- Score: 63.55007056410914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech information can be roughly decomposed into four components: language
content, timbre, pitch, and rhythm. Obtaining disentangled representations of
these components is useful in many speech analysis and generation applications.
Recently, state-of-the-art voice conversion systems have led to speech
representations that can disentangle speaker-dependent and independent
information. However, these systems can only disentangle timbre, while
information about pitch, rhythm and content is still mixed together. Further
disentangling the remaining speech components is an under-determined problem in
the absence of explicit annotations for each component, which are difficult and
expensive to obtain. In this paper, we propose SpeechSplit, which can blindly
decompose speech into its four components by introducing three carefully
designed information bottlenecks. SpeechSplit is among the first algorithms
that can separately perform style transfer on timbre, pitch and rhythm without
text labels. Our code is publicly available at
https://github.com/auspicious3000/SpeechSplit.
Related papers
- DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage [7.096838107088313]
DisfluencySpeech is a studio-quality labeled English speech dataset with paralanguage.
A single speaker recreates nearly 10 hours of expressive utterances from the Switchboard-1 Telephone Speech Corpus (Switchboard)
arXiv Detail & Related papers (2024-06-13T05:23:22Z) - Removing Speaker Information from Speech Representation using Variable-Length Soft Pooling [16.73336092521471]
This paper aims to remove speaker information by exploiting the structured nature of speech.
A neural network predicts these boundaries, enabling variable-length pooling for event-based representation extraction.
To confirm that the learned representation includes contents information but is independent of speaker information, the model was evaluated with libri-light's phonetic ABX task and SUPERB's speaker identification task.
arXiv Detail & Related papers (2024-04-01T01:49:09Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Automatic Speech Disentanglement for Voice Conversion using Rank Module
and Speech Augmentation [4.961389445237138]
Voice Conversion (VC) converts the voice of a source speech to that of a target while maintaining the source's content.
We propose a VC model that can automatically disentangle speech into four components using only two augmentation functions.
arXiv Detail & Related papers (2023-06-21T13:28:06Z) - UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice
Conversion [63.346825713704625]
Text-to-speech (TTS) and voice conversion (VC) are two different tasks aiming at generating high quality speaking voice according to different input modality.
This paper proposes UnifySpeech, which brings TTS and VC into a unified framework for the first time.
arXiv Detail & Related papers (2023-01-10T06:06:57Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - SpeechSplit 2.0: Unsupervised speech disentanglement for voice
conversion Without tuning autoencoder Bottlenecks [39.67320815230375]
SpeechSplit can perform aspect-specific voice conversion by disentangling speech into content, rhythm, pitch, and timbre using multiple autoencoders.
This paper proposes SpeechSplit 2.0, which constrains the information flow of the speech component to be disentangled on the autoencoder input.
arXiv Detail & Related papers (2022-03-26T21:01:26Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.