LSTM Acoustic Models Learn to Align and Pronounce with Graphemes
- URL: http://arxiv.org/abs/2008.06121v1
- Date: Thu, 13 Aug 2020 21:38:36 GMT
- Title: LSTM Acoustic Models Learn to Align and Pronounce with Graphemes
- Authors: Arindrima Datta, Guanlong Zhao, Bhuvana Ramabhadran, Eugene Weinstein
- Abstract summary: We propose a grapheme-based speech recognizer that can be trained in a purely data-driven fashion.
We show that the grapheme models are competitive in WER with their phoneme-output counterparts when trained on large datasets.
- Score: 22.453756228457017
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Automated speech recognition coverage of the world's languages continues to
expand. However, standard phoneme based systems require handcrafted lexicons
that are difficult and expensive to obtain. To address this problem, we propose
a training methodology for a grapheme-based speech recognizer that can be
trained in a purely data-driven fashion. Built with LSTM networks and trained
with the cross-entropy loss, the grapheme-output acoustic models we study are
also extremely practical for real-world applications as they can be decoded
with conventional ASR stack components such as language models and FST
decoders, and produce good quality audio-to-grapheme alignments that are useful
in many speech applications. We show that the grapheme models are competitive
in WER with their phoneme-output counterparts when trained on large datasets,
with the advantage that grapheme models do not require explicit linguistic
knowledge as an input. We further compare the alignments generated by the
phoneme and grapheme models to demonstrate the quality of the pronunciations
learnt by them using four Indian languages that vary linguistically in spoken
and written forms.
Related papers
- TIPAA-SSL: Text Independent Phone-to-Audio Alignment based on Self-Supervised Learning and Knowledge Transfer [3.9981390090442694]
We present a novel approach for text independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer.
We evaluate our model using synthetic native data from the TIMIT dataset and the SCRIBE dataset for American and British English.
Our proposed model outperforms the state-of-the-art (charsiu) in statistical metrics and has applications in language learning and speech processing systems.
arXiv Detail & Related papers (2024-05-03T14:25:21Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Classification of Phonological Parameters in Sign Languages [0.0]
Linguistic research often breaks down signs into constituent parts to study sign languages.
We show how a single model can be used to recognise the individual phonological parameters within sign languages.
arXiv Detail & Related papers (2022-05-24T13:40:45Z) - Learning to pronounce as measuring cross lingual joint
orthography-phonology complexity [0.0]
We investigate what makes a language "hard to pronounce" by modelling the task of grapheme-to-phoneme (g2p) transliteration.
We show that certain characteristics emerge that separate easier and harder languages with respect to learning to pronounce.
arXiv Detail & Related papers (2022-01-29T14:44:39Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Differentiable Allophone Graphs for Language-Universal Speech
Recognition [77.2981317283029]
Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages.
We present a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings.
We build a universal phone-based speech recognition model with interpretable probabilistic phone-to-phoneme mappings for each language.
arXiv Detail & Related papers (2021-07-24T15:09:32Z) - A systematic comparison of grapheme-based vs. phoneme-based label units
for encoder-decoder-attention models [42.761409598613845]
We do a systematic comparison between grapheme- and phoneme-based output labels for an encoder-decoder-attention ASR model.
Experiments performed on the Switchboard 300h and LibriSpeech benchmarks show that phoneme-based modeling is competitive to grapheme-based encoder-decoder-attention modeling.
arXiv Detail & Related papers (2020-05-19T09:54:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.