What shall we do with an hour of data? Speech recognition for the un-
and under-served languages of Common Voice
- URL: http://arxiv.org/abs/2105.04674v1
- Date: Mon, 10 May 2021 21:16:28 GMT
- Title: What shall we do with an hour of data? Speech recognition for the un-
and under-served languages of Common Voice
- Authors: Francis M. Tyers and Josh Meyer
- Abstract summary: This report describes the methods and results of a three-week sprint to produce deployable speech recognition models for 31 under-served languages of the Common Voice project.
- Score: 0.20774268785384567
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This technical report describes the methods and results of a three-week
sprint to produce deployable speech recognition models for 31 under-served
languages of the Common Voice project. We outline the preprocessing steps,
hyperparameter selection, and resulting accuracy on official testing sets. In
addition to this we evaluate the models on multiple tasks: closed-vocabulary
speech recognition, pre-transcription, forced alignment, and key-word spotting.
The following experiments use Coqui STT, a toolkit for training and deployment
of neural Speech-to-Text models.
Related papers
- Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition [110.8431434620642]
We introduce the generative speech transcription error correction (GenSEC) challenge.
This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition.
We discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.
arXiv Detail & Related papers (2024-09-15T16:32:49Z) - GRASS: Unified Generation Model for Speech-to-Semantic Tasks [7.044414457214718]
We introduce a unified end-to-end (E2E) framework that generates target text conditioned on a task-related prompt for audio data.
Our proposed model achieves state-of-the-art (SOTA) results on many benchmarks covering speech named entity recognition, speech sentiment analysis, speech question answering, and more.
To facilitate future work on instruction fine-tuning for speech-to-semantic tasks, we release our instruction dataset and code.
arXiv Detail & Related papers (2023-09-06T06:44:26Z) - Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot
Task Generalization [61.60501633397704]
We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering.
We design task-specific prompts, by either leveraging another large-scale model, or simply manipulating the special tokens in the default prompts.
Experiments show that our proposed prompts improve performance by 10% to 45% on the three zero-shot tasks, and even outperform SotA supervised models on some datasets.
arXiv Detail & Related papers (2023-05-18T16:32:58Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Generative Spoken Language Modeling from Raw Audio [42.153136032037175]
Generative spoken language modeling involves learning jointly the acoustic and linguistic characteristics of a language from raw audio only (without text or labels)
We introduce metrics to automatically evaluate the generated output in terms of acoustic and linguistic quality in two associated end-to-end tasks.
We test baseline systems consisting of a discrete speech encoder (returning discrete, low, pseudo-text units), a generative language model (trained on pseudo-text units) and a speech decoder.
arXiv Detail & Related papers (2021-02-01T21:41:40Z) - Multitask Training with Text Data for End-to-End Speech Recognition [45.35605825009208]
We propose a multitask training method for attention-based end-to-end speech recognition models.
We regularize the decoder in a listen, attend, and spell model by multitask training it on both audio-text and text-only data.
arXiv Detail & Related papers (2020-10-27T14:29:28Z) - Contextualized Spoken Word Representations from Convolutional
Autoencoders [2.28438857884398]
This paper proposes a Convolutional Autoencoder based neural architecture to model syntactically and semantically adequate contextualized representations of varying length spoken words.
The proposed model was able to demonstrate its robustness when compared to the other two language-based models.
arXiv Detail & Related papers (2020-07-06T16:48:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.