Universal Automatic Phonetic Transcription into the International
Phonetic Alphabet
- URL: http://arxiv.org/abs/2308.03917v1
- Date: Mon, 7 Aug 2023 21:29:51 GMT
- Title: Universal Automatic Phonetic Transcription into the International
Phonetic Alphabet
- Authors: Chihiro Taguchi, Yusuke Sakai, Parisa Haghani, David Chiang
- Abstract summary: We present a state-of-the-art model for transcribing speech in any language into the International Phonetic Alphabet (IPA)
Our model is based on wav2vec 2.0 and is fine-tuned to predict IPA from audio input.
We show that the quality of our universal speech-to-IPA models is close to that of human annotators.
- Score: 21.000425416084706
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a state-of-the-art model for transcribing speech in any
language into the International Phonetic Alphabet (IPA). Transcription of
spoken languages into IPA is an essential yet time-consuming process in
language documentation, and even partially automating this process has the
potential to drastically speed up the documentation of endangered languages.
Like the previous best speech-to-IPA model (Wav2Vec2Phoneme), our model is
based on wav2vec 2.0 and is fine-tuned to predict IPA from audio input. We use
training data from seven languages from CommonVoice 11.0, transcribed into IPA
semi-automatically. Although this training dataset is much smaller than
Wav2Vec2Phoneme's, its higher quality lets our model achieve comparable or
better results. Furthermore, we show that the quality of our universal
speech-to-IPA models is close to that of human annotators.
Related papers
- The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language [7.0944623704102625]
We show that phoneme-based models for speech processing can achieve strong crosslinguistic generalizability to unseen languages.
We propose CLAP-IPA, a multi-lingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between arbitrary speech signals and phonemic sequences.
arXiv Detail & Related papers (2023-11-14T17:09:07Z) - Character-Level Bangla Text-to-IPA Transcription Using Transformer
Architecture with Sequence Alignment [0.0]
International Phonetic Alphabet (IPA) is indispensable in language learning and understanding.
Bhutan being 7th as one of the widely used languages, gives rise to the need for IPA in its domain.
In this study, we have utilized a transformer-based sequence-to-sequence model at the letter and symbol level to get the IPA of each Bangla word.
arXiv Detail & Related papers (2023-11-07T08:20:06Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [76.95115818308918]
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.
This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages.
We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
arXiv Detail & Related papers (2023-03-02T07:47:18Z) - Multilingual Zero Resource Speech Recognition Base on Self-Supervise
Pre-Trained Acoustic Models [14.887781621924255]
This paper is the first attempt to extend the use of pre-trained models into word-level zero-resource speech recognition.
It is done by fine-tuning the pre-trained models on IPA phoneme transcriptions and decoding with a language model trained on extra texts.
Experiments on Wav2vec 2.0 and HuBERT models show that this method can achieve less than 20% word error rate on some languages.
arXiv Detail & Related papers (2022-10-13T12:11:18Z) - Phonological Features for 0-shot Multilingual Speech Synthesis [50.591267188664666]
We show that code-switching is possible for languages unseen during training, even within monolingual models.
We generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.
arXiv Detail & Related papers (2020-08-06T18:25:18Z) - GIPFA: Generating IPA Pronunciation from Audio [0.0]
In this study, we examine the use of an Artificial Neural Network (ANN) model to automatically extract the IPA phonemic pronunciation of a word.
Based on the French Wikimedia dictionary, we trained our model which then correctly predicted 75% of the IPA pronunciations tested.
arXiv Detail & Related papers (2020-06-13T06:14:11Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z) - AlloVera: A Multilingual Allophone Database [137.3686036294502]
AlloVera provides mappings from 218 allophones to phonemes for 14 languages.
We show that a "universal" allophone model, Allosaurus, built with AlloVera, outperforms "universal" phonemic models and language-specific models on a speech-transcription task.
arXiv Detail & Related papers (2020-04-17T02:02:18Z) - Towards Zero-shot Learning for Automatic Phonemic Transcription [82.9910512414173]
A more challenging problem is to build phonemic transcribers for languages with zero training data.
Our model is able to recognize unseen phonemes in the target language without any training data.
It achieves 7.7% better phoneme error rate on average over a standard multilingual model.
arXiv Detail & Related papers (2020-02-26T20:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.