RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis
- URL: http://arxiv.org/abs/2106.08468v1
- Date: Tue, 15 Jun 2021 22:24:38 GMT
- Title: RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis
- Authors: Rohola Zandie, Mohammad H. Mahoor, Julia Madsen, and Eshrat S. Emamian
- Abstract summary: RyanSpeech is a new speech corpus for research on automated text-to-speech (TTS) systems.
It contains over 10 hours of a professional male voice actor's speech recorded at 44.1 kHz.
- Score: 3.6406488220483317
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces RyanSpeech, a new speech corpus for research on
automated text-to-speech (TTS) systems. Publicly available TTS corpora are
often noisy, recorded with multiple speakers, or lack quality male speech data.
In order to meet the need for a high quality, publicly available male speech
corpus within the field of speech recognition, we have designed and created
RyanSpeech which contains textual materials from real-world conversational
settings. These materials contain over 10 hours of a professional male voice
actor's speech recorded at 44.1 kHz. This corpus's design and pipeline make
RyanSpeech ideal for developing TTS systems in real-world applications. To
provide a baseline for future research, protocols, and benchmarks, we trained 4
state-of-the-art speech models and a vocoder on RyanSpeech. The results show
3.36 in mean opinion scores (MOS) in our best model. We have made both the
corpus and trained models for public use.
Related papers
- HierSpeech++: Bridging the Gap between Semantic and Acoustic
Representation of Speech by Hierarchical Variational Inference for Zero-shot
Speech Synthesis [39.892633589217326]
Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis.
This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC)
arXiv Detail & Related papers (2023-11-21T09:07:11Z) - SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language
Models [58.996653700982556]
Existing speech tokens are not specifically designed for speech language modeling.
We propose SpeechTokenizer, a unified speech tokenizer for speech large language models.
Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark.
arXiv Detail & Related papers (2023-08-31T12:53:09Z) - ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph
Reading [65.88161811719353]
This work develops a lightweight yet effective Text-to-Speech system, ContextSpeech.
We first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding.
We construct hierarchically-structured textual semantics to broaden the scope for global context enhancement.
Experiments show that ContextSpeech significantly improves the voice quality and prosody in paragraph reading with competitive model efficiency.
arXiv Detail & Related papers (2023-07-03T06:55:03Z) - PolyVoice: Language Models for Speech to Speech Translation [50.31000706309143]
PolyVoice is a language model-based framework for speech-to-speech translation (S2ST)
We use discretized speech units, which are generated in a fully unsupervised way.
For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model.
arXiv Detail & Related papers (2023-06-05T15:53:15Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus [3.1925030748447747]
We present a speech corpus for Classical Arabic Text-to-Speech (ClArTTS) to support the development of end-to-end TTS systems for Arabic.
The speech is extracted from a LibriVox audiobook, which is then processed, segmented, and manually transcribed and annotated.
The final ClArTTS corpus contains about 12 hours of speech from a single male speaker sampled at 40100 kHz.
arXiv Detail & Related papers (2023-02-28T20:18:59Z) - IMaSC -- ICFOSS Malayalam Speech Corpus [0.0]
We present IMaSC, a Malayalam text and speech corpora containing approximately 50 hours of recorded speech.
With 8 speakers and a total of 34,473 text-audio pairs, IMaSC is larger than every other publicly available alternative.
We show that our models perform significantly better in terms of naturalness compared to previous studies and publicly available models, with an average mean opinion score of 4.50.
arXiv Detail & Related papers (2022-11-23T09:21:01Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - LibriS2S: A German-English Speech-to-Speech Translation Corpus [12.376309678270275]
We create the first publicly available speech-to-speech training corpus between German and English.
This allows the creation of a new text-to-speech and speech-to-speech translation model.
We propose Text-to-Speech models based on the example of the recently proposed FastSpeech 2 model.
arXiv Detail & Related papers (2022-04-22T09:33:31Z) - WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech
Recognition [25.31180901037065]
WenetSpeech is a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech.
We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions.
arXiv Detail & Related papers (2021-10-07T12:05:29Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.