Statistical Context-Dependent Units Boundary Correction for Corpus-based
Unit-Selection Text-to-Speech
- URL: http://arxiv.org/abs/2003.02837v2
- Date: Wed, 29 Apr 2020 15:52:05 GMT
- Title: Statistical Context-Dependent Units Boundary Correction for Corpus-based
Unit-Selection Text-to-Speech
- Authors: Claudio Zito, Fabio Tesser, Mauro Nicolao, Piero Cosi
- Abstract summary: We present an innovative technique for speaker adaptation in order to improve the accuracy of segmentation with application to unit-selection Text-To-Speech (TTS) systems.
Unlike conventional techniques for speaker adaptation, we aim to use only context dependent characteristics extrapolated with linguistic analysis techniques.
- Score: 1.4337588659482519
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In this study, we present an innovative technique for speaker adaptation in
order to improve the accuracy of segmentation with application to
unit-selection Text-To-Speech (TTS) systems. Unlike conventional techniques for
speaker adaptation, which attempt to improve the accuracy of the segmentation
using acoustic models that are more robust in the face of the speaker's
characteristics, we aim to use only context dependent characteristics
extrapolated with linguistic analysis techniques. In simple terms, we use the
intuitive idea that context dependent information is tightly correlated with
the related acoustic waveform. We propose a statistical model, which predicts
correcting values to reduce the systematic error produced by a state-of-the-art
Hidden Markov Model (HMM) based speech segmentation. Our approach consists of
two phases: (1) identifying context-dependent phonetic unit classes (for
instance, the class which identifies vowels as being the nucleus of
monosyllabic words); and (2) building a regression model that associates the
mean error value made by the ASR during the segmentation of a single speaker
corpus to each class. The success of the approach is evaluated by comparing the
corrected boundaries of units and the state-of-the-art HHM segmentation against
a reference alignment, which is supposed to be the optimal solution. In
conclusion, our work supplies a first analysis of a model sensitive to
speaker-dependent characteristics, robust to defective and noisy information,
and a very simple implementation which could be utilized as an alternative to
either more expensive speaker-adaptation systems or of numerous manual
correction sessions.
Related papers
- Listenable Maps for Zero-Shot Audio Classifiers [12.446324804274628]
We introduce LMAC-Z (Listenable Maps for Audio) for the first time in the Zero-Shot context.
We show that our method produces meaningful explanations that correlate well with different text prompts.
arXiv Detail & Related papers (2024-05-27T19:25:42Z) - Learning Disentangled Speech Representations [0.412484724941528]
SynSpeech is a novel large-scale synthetic speech dataset designed to enable research on disentangled speech representations.
We present a framework to evaluate disentangled representation learning techniques, applying both linear probing and established supervised disentanglement metrics.
We find that SynSpeech facilitates benchmarking across a range of factors, achieving promising disentanglement of simpler features like gender and speaking style, while highlighting challenges in isolating complex attributes like speaker identity.
arXiv Detail & Related papers (2023-11-04T04:54:17Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Robust Acoustic and Semantic Contextual Biasing in Neural Transducers
for Speech Recognition [14.744220870243932]
We propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing.
We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context.
Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes.
arXiv Detail & Related papers (2023-05-09T08:51:44Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Using multiple reference audios and style embedding constraints for
speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Segmenting Subtitles for Correcting ASR Segmentation Errors [11.854481771567503]
We propose a model for correcting the acoustic segmentation of ASR models for low-resource languages.
We train a neural tagging model for correcting ASR acoustic segmentation and show that it improves downstream performance.
arXiv Detail & Related papers (2021-04-16T03:04:10Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.