Automatic Prosody Annotation with Pre-Trained Text-Speech Model
- URL: http://arxiv.org/abs/2206.07956v1
- Date: Thu, 16 Jun 2022 06:54:16 GMT
- Title: Automatic Prosody Annotation with Pre-Trained Text-Speech Model
- Authors: Ziqian Dai, Jianwei Yu, Yan Wang, Nuo Chen, Yanyao Bian, Guangzhi Li,
Deng Cai, Dong Yu
- Abstract summary: We propose to automatically extract prosodic boundary labels from text-audio data via a neural text-speech model with pre-trained audio encoders.
This model is pre-trained on text and speech data separately and jointly fine-tuned on TTS data in a triplet format: speech, text, prosody
- Score: 48.47706377700962
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prosodic boundary plays an important role in text-to-speech synthesis (TTS)
in terms of naturalness and readability. However, the acquisition of prosodic
boundary labels relies on manual annotation, which is costly and
time-consuming. In this paper, we propose to automatically extract prosodic
boundary labels from text-audio data via a neural text-speech model with
pre-trained audio encoders. This model is pre-trained on text and speech data
separately and jointly fine-tuned on TTS data in a triplet format: {speech,
text, prosody}. The experimental results on both automatic evaluation and human
evaluation demonstrate that: 1) the proposed text-speech prosody annotation
framework significantly outperforms text-only baselines; 2) the quality of
automatic prosodic boundary annotations is comparable to human annotations; 3)
TTS systems trained with model-annotated boundaries are slightly better than
systems that use manual ones.
Related papers
- Scaling Speech-Text Pre-training with Synthetic Interleaved Data [31.77653849518526]
Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction.
Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data.
We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora.
arXiv Detail & Related papers (2024-11-26T17:19:09Z) - Textless Dependency Parsing by Labeled Sequence Prediction [18.32371054754222]
"textless" methods process speech representations without automatic speech recognition systems.
Our proposed method predicts a dependency tree from a speech signal without transcribing, representing the tree as a labeled sequence.
Our findings highlight the importance of fusing word-level representations and sentence-level prosody for enhanced parsing performance.
arXiv Detail & Related papers (2024-07-14T08:38:14Z) - Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP [18.90593650641679]
A two-stage automatic annotation pipeline is proposed in this paper.
In the first stage, we use contrastive pretraining of Speech-Silence and Word-Punctuation pairs to enhance prosodic information in latent representations.
In the second stage, we build a multi-modal prosody annotator, comprising pretrained encoders, a text-speech fusing scheme, and a sequence classifier.
Experiments on English prosodic boundaries demonstrate that our method achieves state-of-the-art (SOTA) performance with 0.72 and 0.93 f1 score for Prosodic Word and Prosodic Phrase
arXiv Detail & Related papers (2023-09-11T12:50:28Z) - token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired
Speech and Text [65.04385919645395]
token2vec is a novel joint pre-training framework for unpaired speech and text based on discrete representations of speech.
Experiments show that token2vec is significantly superior to various speech-only pre-training baselines, with up to 17.7% relative WER reduction.
arXiv Detail & Related papers (2022-10-30T06:38:19Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Guided-TTS:Text-to-Speech with Untranscribed Speech [22.548875263927396]
We present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data.
For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms.
arXiv Detail & Related papers (2021-11-23T10:05:05Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.