Phonotactic Complexity across Dialects
- URL: http://arxiv.org/abs/2402.12998v1
- Date: Tue, 20 Feb 2024 13:25:39 GMT
- Title: Phonotactic Complexity across Dialects
- Authors: Ryan Soh-Eun Shim, Kalvin Chang, David R. Mortensen
- Abstract summary: Received wisdom in linguistic typology holds that if the structure of a language becomes more complex in one dimension, it will simplify in another.
We study this claim on a micro-level, using a tightly-controlled sample of Dutch dialects (across 366 collection sites) and Min dialects (across 60 sites)
We find empirical evidence for a tradeoff between word length and a computational measure of phonotactic complexity from a LSTM-based phone-level language model.
- Score: 9.169501109658675
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Received wisdom in linguistic typology holds that if the structure of a
language becomes more complex in one dimension, it will simplify in another,
building on the assumption that all languages are equally complex (Joseph and
Newmeyer, 2012). We study this claim on a micro-level, using a
tightly-controlled sample of Dutch dialects (across 366 collection sites) and
Min dialects (across 60 sites), which enables a more fair comparison across
varieties. Even at the dialect level, we find empirical evidence for a tradeoff
between word length and a computational measure of phonotactic complexity from
a LSTM-based phone-level language model-a result previously documented only at
the language level. A generalized additive model (GAM) shows that dialects with
low phonotactic complexity concentrate around the capital regions, which we
hypothesize to correspond to prior hypotheses that language varieties of
greater or more diverse populations show reduced phonotactic complexity. We
also experiment with incorporating the auxiliary task of predicting syllable
constituency, but do not find an increase in the negative correlation observed.
Related papers
- Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - Modeling Orthographic Variation in Occitan's Dialects [3.038642416291856]
Large multilingual models minimize the need for spelling normalization during pre-processing.
Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
arXiv Detail & Related papers (2024-04-30T07:33:51Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models [57.225289079198454]
We propose mPLMSim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora.
Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexico, genealogical language family, and geographical sprachbund.
We further investigate whether mPLMSim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks.
arXiv Detail & Related papers (2023-05-23T04:44:26Z) - Cross-Lingual Transfer of Cognitive Processing Complexity [11.939409227407769]
We use sentence-level eye-tracking patterns as a cognitive indicator for structural complexity.
We show that the multilingual model XLM-RoBERTa can successfully predict varied patterns for 13 typologically diverse languages.
arXiv Detail & Related papers (2023-02-24T15:48:23Z) - Investigating the Impact of Cross-lingual Acoustic-Phonetic Similarities
on Multilingual Speech Recognition [31.575930914290762]
A novel data-driven approach is proposed to investigate the cross-lingual acoustic-phonetic similarities.
Deep neural networks are trained as mapping networks to transform the distributions from different acoustic models into a directly comparable form.
A relative improvement of 8% over monolingual counterpart is achieved.
arXiv Detail & Related papers (2022-07-07T15:55:41Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Neural Polysynthetic Language Modelling [15.257624461339867]
In high-resource languages, a common approach is to treat morphologically-distinct variants of a common root as completely independent word types.
This assumes, that there are limited inflections per root, and that the majority will appear in a large enough corpus.
We examine the current state-of-the-art in language modelling, machine translation, and text prediction for four polysynthetic languages.
arXiv Detail & Related papers (2020-05-11T22:57:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.