Learning to pronounce as measuring cross lingual joint
orthography-phonology complexity
- URL: http://arxiv.org/abs/2202.00794v1
- Date: Sat, 29 Jan 2022 14:44:39 GMT
- Title: Learning to pronounce as measuring cross lingual joint
orthography-phonology complexity
- Authors: Domenic Rosati
- Abstract summary: We investigate what makes a language "hard to pronounce" by modelling the task of grapheme-to-phoneme (g2p) transliteration.
We show that certain characteristics emerge that separate easier and harder languages with respect to learning to pronounce.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work has demonstrated that machine learning models allow us to compare
languages by showing how hard each language might be to learn under specific
tasks. Following this line of investigation, we investigate what makes a
language "hard to pronounce" by modelling the task of grapheme-to-phoneme (g2p)
transliteration. By training a character-level transformer model on this task
across 22 languages and measuring the model's proficiency against its grapheme
and phoneme inventories, we show that certain characteristics emerge that
separate easier and harder languages with respect to learning to pronounce.
Namely that the complexity of a languages pronunciation from its orthography is
due to how expressive or simple its grapheme-to-phoneme mapping is. Further
discussion illustrates how future studies should consider relative data
sparsity per language in order to design more fair cross lingual comparison
tasks.
Related papers
- Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas [7.585433383340306]
We show that small models based on the Llama architecture can achieve strong linguistic performance on standard syntactic and novel lexical/phonetic benchmarks.
Our findings suggest a promising direction for creating more linguistically plausible language models that are better suited for computational studies of language acquisition and processing.
arXiv Detail & Related papers (2024-10-02T12:36:08Z) - Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement [1.4335183427838039]
We take the approach of developing curated synthetic data on a large scale, with specific properties.
We use a new multiple-choice task and datasets, Blackbird Language Matrices, to focus on a specific grammatical structural phenomenon.
We show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences.
arXiv Detail & Related papers (2024-09-10T14:58:55Z) - The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments [57.273662221547056]
In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance.
We observe that the existence of a predominant language during training boosts the performance of less frequent languages.
As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.
arXiv Detail & Related papers (2024-04-11T17:58:05Z) - Information-Theoretic Characterization of Vowel Harmony: A
Cross-Linguistic Study on Word Lists [18.138642719651994]
We define an information-theoretic measure of harmonicity based on predictability of vowels in a natural language lexicon.
We estimate this harmonicity using phoneme-level language models (PLMs)
Our work demonstrates that word lists are a valuable resource for typological research.
arXiv Detail & Related papers (2023-08-09T11:32:16Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Discovering Phonetic Inventories with Crosslingual Automatic Speech
Recognition [71.49308685090324]
This paper investigates the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language.
We find that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery.
arXiv Detail & Related papers (2022-01-26T22:12:55Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - LSTM Acoustic Models Learn to Align and Pronounce with Graphemes [22.453756228457017]
We propose a grapheme-based speech recognizer that can be trained in a purely data-driven fashion.
We show that the grapheme models are competitive in WER with their phoneme-output counterparts when trained on large datasets.
arXiv Detail & Related papers (2020-08-13T21:38:36Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.