Related papers: An investigation of phrase break prediction in an End-to-End TTS system

Related papers

Word-wise intonation model for cross-language TTS systems [0.0]
The proposed model is suitable for automatic data markup and its extended application to text-to-speech systems. The key idea is a partial elimination of the variability connected with different placements of a stressed syllable in a word. The proposed model could be used as a tool for intonation research or as a backbone for prosody description in text-to-speech systems.
arXiv Detail & Related papers (2024-09-30T15:09:42Z)
Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling [13.757256085713571]
We present a novel two-stage prediction pipeline, named TAP-FM, proposed in this paper. Specifically, we present a Multi-scale Contrastive Text-audio Pre-training protocol (MC-TAP), which hammers at acquiring richer insights via multi-granularity contrastive pre-training in an unsupervised manner. Our framework demonstrates the ability to delve deep into both global and local text-audio semantic and acoustic representations.
arXiv Detail & Related papers (2024-04-14T08:56:19Z)
Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases [79.07111754406841]
This work proposes using contrastive evaluation to measure the ability of direct S2TT systems to disambiguate utterances where prosody plays a crucial role. Our results clearly demonstrate the value of direct translation systems over cascade translation models.
arXiv Detail & Related papers (2024-02-01T14:46:35Z)
Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction [15.72317249204736]
We propose a novel text-to-speech (TTS) framework centered around a neural transducer. Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages. Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-01-03T02:03:36Z)
Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction [14.661123738628772]
We introduce a text-to-speech(TTS) framework based on a neural transducer. We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints.
arXiv Detail & Related papers (2023-11-06T06:13:39Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling [25.966328901566815]
We propose PuaseSpeech, a speech synthesis system with a pre-trained language model and pause-based prosody modeling. Experimental results show PauseSpeech outperforms previous models in terms of naturalness.
arXiv Detail & Related papers (2023-06-13T01:36:55Z)
Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech [40.65850332919397]
We propose more powerful pause insertion frameworks based on a pre-trained language model. Our approach uses bidirectional encoder representations from transformers (BERT) pre-trained on a large-scale text corpus. We also leverage duration-aware pause insertion for more natural multi-speaker TTS.
arXiv Detail & Related papers (2023-02-27T10:40:41Z)
ParaTTS: Learning Linguistic and Prosodic Cross-sentence Information in Paragraph-based TTS [19.988974534582205]
We propose to model linguistic and prosodic information by considering cross-sentence, embedded structure in training. We trained on a storytelling audio-book corpus (4.08 hours), recorded by a female Mandarin Chinese speaker. The proposed TTS model demonstrates that it can produce rather natural and good-quality speech paragraph-wise.
arXiv Detail & Related papers (2022-09-14T08:34:16Z)
BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model [29.188684861193092]
We evaluate the accuracy of a BERT model, finetuned to predict quantized acoustic prominence features, on utterances containing contrastive focus. We also evaluate the controllability of pronoun prominence in a TTS model conditioned on acoustic prominence features.
arXiv Detail & Related papers (2022-07-04T20:43:41Z)
Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks. In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
arXiv Detail & Related papers (2022-06-09T15:39:19Z)
A Complementary Joint Training Approach Using Unpaired Speech and Text for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model. Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z)
LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction. We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z)
A study on the efficacy of model pre-training in developing neural text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance. It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z)
On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis [102.80458458550999]
We investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech. Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility.
arXiv Detail & Related papers (2021-10-04T02:03:28Z)
Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead. When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z)
Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models. We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z)
Introducing Syntactic Structures into Target Opinion Word Extraction with Deep Learning [89.64620296557177]
We propose to incorporate the syntactic structures of the sentences into the deep learning models for targeted opinion word extraction. We also introduce a novel regularization technique to improve the performance of the deep learning models. The proposed model is extensively analyzed and achieves the state-of-the-art performance on four benchmark datasets.
arXiv Detail & Related papers (2020-10-26T07:13:17Z)
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS) A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation. We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.