Multi-Scale Spectrogram Modelling for Neural Text-to-Speech
- URL: http://arxiv.org/abs/2106.15649v1
- Date: Tue, 29 Jun 2021 18:01:34 GMT
- Title: Multi-Scale Spectrogram Modelling for Neural Text-to-Speech
- Authors: Ammar Abbas, Bajibabu Bollepalli, Alexis Moinet, Arnaud Joly, Penny
Karanasou, Peter Makarov, Simon Slangens, Sri Karlapati, Thomas Drugman
- Abstract summary: We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody.
We present details for two specific versions of MSS called Word-level MSS and Sentence-level MSS.
- Score: 19.42517284981061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to
synthesise speech with an improved coarse and fine-grained prosody. We present
a generic multi-scale spectrogram prediction mechanism where the system first
predicts coarser scale mel-spectrograms that capture the suprasegmental
information in speech, and later uses these coarser scale mel-spectrograms to
predict finer scale mel-spectrograms capturing fine-grained prosody.
We present details for two specific versions of MSS called Word-level MSS and
Sentence-level MSS where the scales in our system are motivated by the
linguistic units. The Word-level MSS models word, phoneme, and frame-level
spectrograms while Sentence-level MSS models sentence-level spectrogram in
addition.
Subjective evaluations show that Word-level MSS performs statistically
significantly better compared to the baseline on two voices.
Related papers
- On combining acoustic and modulation spectrograms in an attention
LSTM-based system for speech intelligibility level classification [0.0]
We present a non-intrusive system based on LSTM networks with attention mechanism designed for speech intelligibility prediction.
Two different strategies for the combination of per-frame acoustic log-mel and modulation spectrograms into the LSTM framework are explored.
The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity.
arXiv Detail & Related papers (2024-02-05T10:26:28Z) - SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition [9.853451215277346]
We propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune the MMS model.
We evaluate SSHR on two multilingual datasets, Common Voice and ML-SUPERB, and the experimental results demonstrate that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-09-29T02:35:36Z) - Towards Robust FastSpeech 2 by Modelling Residual Multimodality [4.4904382374090765]
State-of-the-art non-autoregressive text-to-speech models based on FastSpeech 2 can efficiently synthesise high-fidelity and natural speech.
We observe characteristic audio distortions in expressive speech datasets.
TVC-GMM reduces spectrogram smoothness and improves perceptual audio quality in particular for expressive datasets.
arXiv Detail & Related papers (2023-06-02T11:03:26Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - MAST: Multiscale Audio Spectrogram Transformers [53.06337011259031]
We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST)
In practice, MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the LAPE Benchmark.
arXiv Detail & Related papers (2022-11-02T23:34:12Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - General-Purpose Speech Representation Learning through a Self-Supervised
Multi-Granularity Framework [114.63823178097402]
This paper presents a self-supervised learning framework, named MGF, for general-purpose speech representation learning.
Specifically, we propose to use generative learning approaches to capture fine-grained information at small time scales and use discriminative learning approaches to distill coarse-grained or semantic information at large time scales.
arXiv Detail & Related papers (2021-02-03T08:13:21Z) - Language Through a Prism: A Spectral Approach for Multiscale Language
Representations [30.224517199646993]
We show that signal processing provides a natural framework for separating structure across scales.
We apply spectral filters to the activations of a neuron across an input, producing filtered embeddings that perform well on part of speech tagging.
We also present a prism layer for training models, which uses spectral filters to constrain different neurons to model structure at different scales.
arXiv Detail & Related papers (2020-11-09T23:17:43Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.