Phonotactic Complexity and its Trade-offs
- URL: http://arxiv.org/abs/2005.03774v1
- Date: Thu, 7 May 2020 21:36:59 GMT
- Title: Phonotactic Complexity and its Trade-offs
- Authors: Tiago Pimentel, Brian Roark, Ryan Cotterell
- Abstract summary: This simple measure allows us to compare the entropy across languages.
We demonstrate a very strong negative correlation of -0.74 between bits per phoneme and the average length of words.
- Score: 73.10961848460613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present methods for calculating a measure of phonotactic complexity---bits
per phoneme---that permits a straightforward cross-linguistic comparison. When
given a word, represented as a sequence of phonemic segments such as symbols in
the international phonetic alphabet, and a statistical model trained on a
sample of word types from the language, we can approximately measure bits per
phoneme using the negative log-probability of that word under the model. This
simple measure allows us to compare the entropy across languages, giving
insight into how complex a language's phonotactics are. Using a collection of
1016 basic concept words across 106 languages, we demonstrate a very strong
negative correlation of -0.74 between bits per phoneme and the average length
of words.
Related papers
- Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas [7.585433383340306]
We show that small models based on the Llama architecture can achieve strong linguistic performance on standard syntactic and novel lexical/phonetic benchmarks.
Our findings suggest a promising direction for creating more linguistically plausible language models that are better suited for computational studies of language acquisition and processing.
arXiv Detail & Related papers (2024-10-02T12:36:08Z) - The Development of a Comprehensive Spanish Dictionary for Phonetic and Lexical Tagging in Socio-phonetic Research (ESPADA) [0.0]
I present the creation of a comprehensive pronunciation dictionary in Spanish (ESPADA) that can be used in most of the dialect variants of Spanish data.
ESPADA is the most complete dictionary with more than 628,000 entries, representing words from 16 countries.
This aims to equip socio-phonetic researchers with a complete open-source tool that enhances dialectal research within socio-phonetic frameworks in the Spanish language.
arXiv Detail & Related papers (2024-07-22T04:51:33Z) - On The Ingredients of an Effective Zero-shot Semantic Parser [95.01623036661468]
We analyze zero-shot learning by paraphrasing training examples of canonical utterances and programs from a grammar.
We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods.
Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.
arXiv Detail & Related papers (2021-10-15T21:41:16Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z) - A Corpus for Large-Scale Phonetic Typology [112.19288631037055]
We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology.
aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
arXiv Detail & Related papers (2020-05-28T13:03:51Z) - Detect Language of Transliterated Texts [0.0]
Informal transliteration from other languages to English is prevalent in social media threads, instant messaging, and discussion forums.
We propose a Language Identification (LID) system, with an approach for feature extraction.
We tokenize the words into phonetic syllables and use a simple Long Short-term Memory (LSTM) network architecture to detect the language of transliterated texts.
arXiv Detail & Related papers (2020-04-26T10:28:02Z) - Universal Phone Recognition with a Multilingual Allophone System [135.2254086165086]
We propose a joint model of language-independent phone and language-dependent phoneme distributions.
In multilingual ASR experiments over 11 languages, we find that this model improves testing performance by 2% phoneme error rate absolute.
Our recognizer achieves phone accuracy improvements of more than 17%, moving a step closer to speech recognition for all languages in the world.
arXiv Detail & Related papers (2020-02-26T21:28:57Z) - Towards Zero-shot Learning for Automatic Phonemic Transcription [82.9910512414173]
A more challenging problem is to build phonemic transcribers for languages with zero training data.
Our model is able to recognize unseen phonemes in the target language without any training data.
It achieves 7.7% better phoneme error rate on average over a standard multilingual model.
arXiv Detail & Related papers (2020-02-26T20:38:42Z) - An efficient automated data analytics approach to large scale
computational comparative linguistics [0.0]
This research project aimed to overcome the challenge of analysing human language relationships.
It developed automated comparison techniques based on the phonetic representation of certain key words and concept.
It led to the development of a workflow which was later implemented by combining Unix shell scripts, a developed R package and SWI Prolog.
arXiv Detail & Related papers (2020-01-31T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.