Phrase break prediction with bidirectional encoder representations in
Japanese text-to-speech synthesis
- URL: http://arxiv.org/abs/2104.12395v1
- Date: Mon, 26 Apr 2021 08:29:29 GMT
- Title: Phrase break prediction with bidirectional encoder representations in
Japanese text-to-speech synthesis
- Authors: Kosuke Futamata, Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana
- Abstract summary: We propose a novel phrase break prediction method that combines implicit features extracted from a pre-trained large language model, a.k.a BERT, and explicit features extracted from BiLSTM with linguistic features.
The proposed method takes account of both representations to extract the latent semantics, which cannot be captured by previous methods.
- Score: 8.391631335854457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel phrase break prediction method that combines implicit
features extracted from a pre-trained large language model, a.k.a BERT, and
explicit features extracted from BiLSTM with linguistic features. In
conventional BiLSTM based methods, word representations and/or sentence
representations are used as independent components. The proposed method takes
account of both representations to extract the latent semantics, which cannot
be captured by previous methods. The objective evaluation results show that the
proposed method obtains an absolute improvement of 3.2 points for the F1 score
compared with BiLSTM-based conventional methods using linguistic features.
Moreover, the perceptual listening test results verify that a TTS system that
applied our proposed method achieved a mean opinion score of 4.39 in prosody
naturalness, which is highly competitive with the score of 4.37 for synthesized
speech with ground-truth phrase breaks.
Related papers
- Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT [81.99600765234285]
We propose an end-to-end framework to predict the pronunciation of a polyphonic character.
The proposed method consists of a pre-trained bidirectional encoder representations from Transformers (BERT) model and a neural network (NN) based classifier.
arXiv Detail & Related papers (2025-01-02T06:51:52Z) - Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs)
Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO)
We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z) - DenoSent: A Denoising Objective for Self-Supervised Sentence
Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective.
By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form.
Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - An investigation of phrase break prediction in an End-to-End TTS system [0.0]
This work explores the use of external phrase break prediction models to enhance listener comprehension in End-to-End Text-to-Speech (TTS) systems.
The effectiveness of these models is evaluated based on listener preferences in subjective tests.
arXiv Detail & Related papers (2023-04-09T04:26:58Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Discrete representations in neural models of spoken language [56.29049879393466]
We compare the merits of four commonly used metrics in the context of weakly supervised models of spoken language.
We find that the different evaluation metrics can give inconsistent results.
arXiv Detail & Related papers (2021-05-12T11:02:02Z) - Language Models for Lexical Inference in Context [4.581468205348204]
Lexical inference in context (LIiC) is the task of recognizing textual entailment between two very similar sentences.
We formulate and evaluate the first approaches based on pretrained language models (LMs) for this task.
All our approaches outperform the previous state of the art, showing the potential of pretrained LMs for LIiC.
arXiv Detail & Related papers (2021-02-10T09:08:22Z) - Translate Reverberated Speech to Anechoic Ones: Speech Dereverberation
with BERT [6.876734825043823]
Single channel speech dereverberation is considered in this work.
Inspired by the recent success of Bidirectional Representations from Transformers (BERT) model in the domain of Natural Language Processing (NLP), we investigate its applicability as backbone sequence model to enhance reverberated speech signal.
arXiv Detail & Related papers (2020-07-16T00:45:27Z) - An Effective Contextual Language Modeling Framework for Speech
Summarization with Augmented Features [13.97006782398121]
Bidirectional Representations from Transformers (BERT) model was proposed and has achieved record-breaking success on many natural language processing tasks.
We explore the incorporation of confidence scores into sentence representations to see if such an attempt could help alleviate the negative effects caused by imperfect automatic speech recognition.
We validate the effectiveness of our proposed method on a benchmark dataset.
arXiv Detail & Related papers (2020-06-01T18:27:48Z) - Syntactic Structure Distillation Pretraining For Bidirectional Encoders [49.483357228441434]
We introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining.
We distill the approximate marginal distribution over words in context from the syntactic LM.
Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data.
arXiv Detail & Related papers (2020-05-27T16:44:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.