SPGISpeech: 5,000 hours of transcribed financial audio for fully
formatted end-to-end speech recognition
- URL: http://arxiv.org/abs/2104.02014v2
- Date: Tue, 6 Apr 2021 04:22:48 GMT
- Title: SPGISpeech: 5,000 hours of transcribed financial audio for fully
formatted end-to-end speech recognition
- Authors: Patrick K. O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid
Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko,
Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, and
Georg Kucsko
- Abstract summary: In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters.
Here we propose a new STT task: end-to-end neural transcription with fully formatted text for target labels.
We present baseline Conformer-based models trained on a corpus of 5,000 hours of professionally transcribed earnings calls, achieving a CER of 1.7.
- Score: 38.96077127913159
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the English speech-to-text (STT) machine learning task, acoustic models
are conventionally trained on uncased Latin characters, and any necessary
orthography (such as capitalization, punctuation, and denormalization of
non-standard words) is imputed by separate post-processing models. This adds
complexity and limits performance, as many formatting tasks benefit from
semantic information present in the acoustic signal but absent in
transcription. Here we propose a new STT task: end-to-end neural transcription
with fully formatted text for target labels. We present baseline
Conformer-based models trained on a corpus of 5,000 hours of professionally
transcribed earnings calls, achieving a CER of 1.7. As a contribution to the
STT research community, we release the corpus free for non-commercial use at
https://datasets.kensho.com/datasets/scribe.
Related papers
- SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units.
Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering.
SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z) - Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems [0.0]
This paper presents an overview of rule-based system for automatic accentuation and phonemic transcription of Russian texts.
Two parts of the developed system, accentuation and transcription, use different approaches to achieve correct phonemic representations of input phrases.
The developed toolkit is written in the Python language and is accessible on GitHub for any researcher interested.
arXiv Detail & Related papers (2024-10-03T14:43:43Z) - Text-To-Speech Synthesis In The Wild [76.71096751337888]
Text-to-speech (TTS) systems are traditionally trained using modest databases of studio-quality, prompted or read speech collected in benign acoustic environments such as anechoic rooms.
We introduce the TTS In the Wild (TITW) dataset, the result of a fully automated pipeline, applied to the VoxCeleb1 dataset commonly used for speaker recognition.
We show that a number of recent TTS models can be trained successfully using TITW-Easy, but that it remains extremely challenging to produce similar results using TITW-Hard.
arXiv Detail & Related papers (2024-09-13T10:58:55Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - T5lephone: Bridging Speech and Text Self-supervised Models for Spoken
Language Understanding via Phoneme level T5 [65.32642587901903]
We conduct extensive studies on how PLMs with different tokenization strategies affect spoken language understanding task.
We extend the idea to create T5lephone, a variant of T5 that is pretrained using phonemicized text.
arXiv Detail & Related papers (2022-11-01T17:00:23Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Transfer Learning Framework for Low-Resource Text-to-Speech using a
Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus.
We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z) - Guided-TTS:Text-to-Speech with Untranscribed Speech [22.548875263927396]
We present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data.
For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms.
arXiv Detail & Related papers (2021-11-23T10:05:05Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.