GIPFA: Generating IPA Pronunciation from Audio
- URL: http://arxiv.org/abs/2006.07573v2
- Date: Tue, 21 Sep 2021 19:53:39 GMT
- Title: GIPFA: Generating IPA Pronunciation from Audio
- Authors: Xavier Marjou
- Abstract summary: In this study, we examine the use of an Artificial Neural Network (ANN) model to automatically extract the IPA phonemic pronunciation of a word.
Based on the French Wikimedia dictionary, we trained our model which then correctly predicted 75% of the IPA pronunciations tested.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Transcribing spoken audio samples into the International Phonetic Alphabet
(IPA) has long been reserved for experts. In this study, we examine the use of
an Artificial Neural Network (ANN) model to automatically extract the IPA
phonemic pronunciation of a word based on its audio pronunciation, hence its
name Generating IPA Pronunciation From Audio (GIPFA). Based on the French
Wikimedia dictionary, we trained our model which then correctly predicted 75%
of the IPA pronunciations tested. Interestingly, by studying inference errors,
the model made it possible to highlight possible errors in the dataset as well
as to identify the closest phonemes in French.
Related papers
- IPA Transcription of Bengali Texts [0.2113150621171959]
The International Phonetic Alphabet (IPA) serves to systematize phonemes in language.
In Bengali phonology and phonetics, ongoing scholarly deliberations persist concerning the IPA standard and core Bengali phonemes.
This work examines prior research, identifies current and potential issues, and suggests a framework for a Bengali IPA standard.
arXiv Detail & Related papers (2024-03-29T09:33:34Z) - The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language [7.0944623704102625]
We show that phoneme-based models for speech processing can achieve strong crosslinguistic generalizability to unseen languages.
We propose CLAP-IPA, a multi-lingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between arbitrary speech signals and phonemic sequences.
arXiv Detail & Related papers (2023-11-14T17:09:07Z) - Character-Level Bangla Text-to-IPA Transcription Using Transformer
Architecture with Sequence Alignment [0.0]
International Phonetic Alphabet (IPA) is indispensable in language learning and understanding.
Bhutan being 7th as one of the widely used languages, gives rise to the need for IPA in its domain.
In this study, we have utilized a transformer-based sequence-to-sequence model at the letter and symbol level to get the IPA of each Bangla word.
arXiv Detail & Related papers (2023-11-07T08:20:06Z) - Universal Automatic Phonetic Transcription into the International
Phonetic Alphabet [21.000425416084706]
We present a state-of-the-art model for transcribing speech in any language into the International Phonetic Alphabet (IPA)
Our model is based on wav2vec 2.0 and is fine-tuned to predict IPA from audio input.
We show that the quality of our universal speech-to-IPA models is close to that of human annotators.
arXiv Detail & Related papers (2023-08-07T21:29:51Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec
Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis.
VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks.
It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z) - IPA-CLIP: Integrating Phonetic Priors into Vision and Language
Pretraining [8.129944388402839]
This paper inserts phonetic prior into Contrastive Language-Image Pretraining (CLIP)
IPA-CLIP comprises this pronunciation encoder and the original CLIP encoders (image and text)
arXiv Detail & Related papers (2023-03-06T13:59:37Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Phonological Features for 0-shot Multilingual Speech Synthesis [50.591267188664666]
We show that code-switching is possible for languages unseen during training, even within monolingual models.
We generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.
arXiv Detail & Related papers (2020-08-06T18:25:18Z) - AlloVera: A Multilingual Allophone Database [137.3686036294502]
AlloVera provides mappings from 218 allophones to phonemes for 14 languages.
We show that a "universal" allophone model, Allosaurus, built with AlloVera, outperforms "universal" phonemic models and language-specific models on a speech-transcription task.
arXiv Detail & Related papers (2020-04-17T02:02:18Z) - Towards Zero-shot Learning for Automatic Phonemic Transcription [82.9910512414173]
A more challenging problem is to build phonemic transcribers for languages with zero training data.
Our model is able to recognize unseen phonemes in the target language without any training data.
It achieves 7.7% better phoneme error rate on average over a standard multilingual model.
arXiv Detail & Related papers (2020-02-26T20:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.