Multilingual Zero Resource Speech Recognition Base on Self-Supervise
Pre-Trained Acoustic Models
- URL: http://arxiv.org/abs/2210.06936v1
- Date: Thu, 13 Oct 2022 12:11:18 GMT
- Title: Multilingual Zero Resource Speech Recognition Base on Self-Supervise
Pre-Trained Acoustic Models
- Authors: Haoyu Wang, Wei-Qiang Zhang, Hongbin Suo, Yulong Wan
- Abstract summary: This paper is the first attempt to extend the use of pre-trained models into word-level zero-resource speech recognition.
It is done by fine-tuning the pre-trained models on IPA phoneme transcriptions and decoding with a language model trained on extra texts.
Experiments on Wav2vec 2.0 and HuBERT models show that this method can achieve less than 20% word error rate on some languages.
- Score: 14.887781621924255
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Labeled audio data is insufficient to build satisfying speech recognition
systems for most of the languages in the world. There have been some
zero-resource methods trying to perform phoneme or word-level speech
recognition without labeled audio data of the target language, but the error
rate of these methods is usually too high to be applied in real-world
scenarios. Recently, the representation ability of self-supervise pre-trained
models has been found to be extremely beneficial in zero-resource phoneme
recognition. As far as we are concerned, this paper is the first attempt to
extend the use of pre-trained models into word-level zero-resource speech
recognition. This is done by fine-tuning the pre-trained models on IPA phoneme
transcriptions and decoding with a language model trained on extra texts.
Experiments on Wav2vec 2.0 and HuBERT models show that this method can achieve
less than 20% word error rate on some languages, and the average error rate on
8 languages is 33.77%.
Related papers
- Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - Simple and Effective Zero-shot Cross-lingual Phoneme Recognition [46.76787843369816]
This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages.
Experiments show that this simple method significantly outperforms prior work which introduced task-specific architectures.
arXiv Detail & Related papers (2021-09-23T22:50:32Z) - Improved Language Identification Through Cross-Lingual Self-Supervised
Learning [37.32193095549614]
We extend previous self-supervised work on language identification by experimenting with pre-trained models.
Results on a 25 languages setup show that with only 10 minutes of labeled data per language, a cross-lingually pre-trained model can achieve over 93% accuracy.
arXiv Detail & Related papers (2021-07-08T19:37:06Z) - Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data.
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.
On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z) - Applying Wav2vec2.0 to Speech Recognition in Various Low-resource
Languages [16.001329145018687]
In the speech domain, wav2vec2.0 starts to show its powerful representation ability and feasibility of ultra-low resource speech recognition on the Librispeech corpus.
However, wav2vec2.0 has not been examined on real spoken scenarios and languages other than English.
We apply pre-trained models to solve low-resource speech recognition tasks in various spoken languages.
arXiv Detail & Related papers (2020-12-22T15:59:44Z) - Towards Zero-shot Learning for Automatic Phonemic Transcription [82.9910512414173]
A more challenging problem is to build phonemic transcribers for languages with zero training data.
Our model is able to recognize unseen phonemes in the target language without any training data.
It achieves 7.7% better phoneme error rate on average over a standard multilingual model.
arXiv Detail & Related papers (2020-02-26T20:38:42Z) - Multilingual acoustic word embedding models for processing zero-resource
languages [37.78342106714364]
We train a single supervised embedding model on labelled data from multiple well-resourced languages.
We then apply it to unseen zero-resource languages.
arXiv Detail & Related papers (2020-02-06T05:53:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.