SPGISpeech: 5,000 hours of transcribed financial audio for fully
formatted end-to-end speech recognition
- URL: http://arxiv.org/abs/2104.02014v2
- Date: Tue, 6 Apr 2021 04:22:48 GMT
- Title: SPGISpeech: 5,000 hours of transcribed financial audio for fully
formatted end-to-end speech recognition
- Authors: Patrick K. O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid
Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko,
Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, and
Georg Kucsko
- Abstract summary: In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters.
Here we propose a new STT task: end-to-end neural transcription with fully formatted text for target labels.
We present baseline Conformer-based models trained on a corpus of 5,000 hours of professionally transcribed earnings calls, achieving a CER of 1.7.
- Score: 38.96077127913159
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the English speech-to-text (STT) machine learning task, acoustic models
are conventionally trained on uncased Latin characters, and any necessary
orthography (such as capitalization, punctuation, and denormalization of
non-standard words) is imputed by separate post-processing models. This adds
complexity and limits performance, as many formatting tasks benefit from
semantic information present in the acoustic signal but absent in
transcription. Here we propose a new STT task: end-to-end neural transcription
with fully formatted text for target labels. We present baseline
Conformer-based models trained on a corpus of 5,000 hours of professionally
transcribed earnings calls, achieving a CER of 1.7. As a contribution to the
STT research community, we release the corpus free for non-commercial use at
https://datasets.kensho.com/datasets/scribe.
Related papers
- Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - T5lephone: Bridging Speech and Text Self-supervised Models for Spoken
Language Understanding via Phoneme level T5 [65.32642587901903]
We conduct extensive studies on how PLMs with different tokenization strategies affect spoken language understanding task.
We extend the idea to create T5lephone, a variant of T5 that is pretrained using phonemicized text.
arXiv Detail & Related papers (2022-11-01T17:00:23Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Transfer Learning Framework for Low-Resource Text-to-Speech using a
Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus.
We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z) - Guided-TTS:Text-to-Speech with Untranscribed Speech [22.548875263927396]
We present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data.
For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms.
arXiv Detail & Related papers (2021-11-23T10:05:05Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Proteno: Text Normalization with Limited Data for Fast Deployment in
Text to Speech Systems [15.401574286479546]
Text Normalization (TN) systems for Text-to-Speech (TTS) on new languages is hard.
We propose a novel architecture to facilitate it for multiple languages while using data less than 3% of the size of the data used by the state of the art results on English.
We publish the first results on TN for TTS in Spanish and Tamil and also demonstrate that the performance of the approach is comparable with the previous work done on English.
arXiv Detail & Related papers (2021-04-15T21:14:28Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.