Global Rhythm Style Transfer Without Text Transcriptions
- URL: http://arxiv.org/abs/2106.08519v1
- Date: Wed, 16 Jun 2021 02:21:00 GMT
- Title: Global Rhythm Style Transfer Without Text Transcriptions
- Authors: Kaizhi Qian, Yang Zhang, Shiyu Chang, Jinjun Xiong, Chuang Gan, David
Cox, Mark Hasegawa-Johnson
- Abstract summary: Prosody plays an important role in characterizing the style of a speaker or an emotion.
Most non-parallel voice or emotion style transfer algorithms do not convert any prosody information.
We propose AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions.
- Score: 98.09972075975976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prosody plays an important role in characterizing the style of a speaker or
an emotion, but most non-parallel voice or emotion style transfer algorithms do
not convert any prosody information. Two major components of prosody are pitch
and rhythm. Disentangling the prosody information, particularly the rhythm
component, from the speech is challenging because it involves breaking the
synchrony between the input speech and the disentangled speech representation.
As a result, most existing prosody style transfer algorithms would need to rely
on some form of text transcriptions to identify the content information, which
confines their application to high-resource languages only. Recently,
SpeechSplit has made sizeable progress towards unsupervised prosody style
transfer, but it is unable to extract high-level global prosody style in an
unsupervised manner. In this paper, we propose AutoPST, which can disentangle
global prosody style from speech without relying on any text transcriptions.
AutoPST is an Autoencoder-based Prosody Style Transfer framework with a
thorough rhythm removal module guided by the self-expressive representation
learning. Experiments on different style transfer tasks show that AutoPST can
effectively convert prosody that correctly reflects the styles of the target
domains.
Related papers
- MSSRNet: Manipulating Sequential Style Representation for Unsupervised
Text Style Transfer [82.37710853235535]
Unsupervised text style transfer task aims to rewrite a text into target style while preserving its main content.
Traditional methods rely on the use of a fixed-sized vector to regulate text style, which is difficult to accurately convey the style strength for each individual token.
Our proposed method addresses this issue by assigning individual style vector to each token in a text, allowing for fine-grained control and manipulation of the style strength.
arXiv Detail & Related papers (2023-06-12T13:12:29Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - StoryTrans: Non-Parallel Story Author-Style Transfer with Discourse
Representations and Content Enhancing [73.81778485157234]
Long texts usually involve more complicated author linguistic preferences such as discourse structures than sentences.
We formulate the task of non-parallel story author-style transfer, which requires transferring an input story into a specified author style.
We use an additional training objective to disentangle stylistic features from the learned discourse representation to prevent the model from degenerating to an auto-encoder.
arXiv Detail & Related papers (2022-08-29T08:47:49Z) - Text-driven Emotional Style Control and Cross-speaker Style Transfer in
Neural TTS [7.384726530165295]
Style control of synthetic speech is often restricted to discrete emotion categories.
We propose a text-based interface for emotional style control and cross-speaker style transfer in multi-speaker TTS.
arXiv Detail & Related papers (2022-07-13T07:05:44Z) - Self-supervised Context-aware Style Representation for Expressive Speech
Synthesis [23.460258571431414]
We propose a novel framework for learning style representation from plain text in a self-supervised manner.
It leverages an emotion lexicon and uses contrastive learning and deep clustering.
Our method achieves improved results according to subjective evaluations on both in-domain and out-of-domain test sets in audiobook speech.
arXiv Detail & Related papers (2022-06-25T05:29:48Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - AlloST: Low-resource Speech Translation without Source Transcription [17.53382405899421]
We propose a learning framework that utilizes a language-independent universal phone recognizer.
The framework is based on an attention-based sequence-to-sequence model.
Experiments conducted on the Fisher Spanish-English and Taigi-Mandarin drama corpora show that our method outperforms the conformer-based baseline.
arXiv Detail & Related papers (2021-05-01T05:30:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.