Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters
- URL: http://arxiv.org/abs/2504.11770v1
- Date: Wed, 16 Apr 2025 05:20:08 GMT
- Title: Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters
- Authors: Takashi Morita, Timothy J. O'Donnell,
- Abstract summary: Cross-linguistically, native words and loanwords follow different phonological rules.<n>The Germanic-Latinate distinction in the English lexicon is learnable from the phonotactic information of individual words.
- Score: 9.220284665192663
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-linguistically, native words and loanwords follow different phonological rules. In English, for example, words of Germanic and Latinate origin exhibit different stress patterns, and a certain syntactic structure is exclusive to Germanic verbs. When seeing them as a cognitive model, however, such etymology-based generalizations face challenges in terms of learnability, since the historical origins of words are presumably inaccessible information for general language learners. In this study, we present computational evidence indicating that the Germanic-Latinate distinction in the English lexicon is learnable from the phonotactic information of individual words. Specifically, we performed an unsupervised clustering on corpus-extracted words, and the resulting word clusters largely aligned with the etymological distinction. The model-discovered clusters also recovered various linguistic generalizations documented in the previous literature regarding the corresponding etymological classes. Moreover, our findings also uncovered previously unrecognized features of the quasi-etymological clusters, offering novel hypotheses for future experimental studies.
Related papers
- Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers [43.756851270091516]
We present a novel approach to extract distinguishing lexical features of dialects by utilizing interpretable dialects.
We experimentally demonstrate that our method successfully identifies key language-specific lexical features that contribute to dialectal variations.
arXiv Detail & Related papers (2024-02-27T22:06:55Z) - Patterns of Persistence and Diffusibility across the World's Languages [3.7055269158186874]
Colexification is a type of similarity where a single lexical form is used to convey multiple meanings.
We shed light on the linguistic causes of cross-lingual similarity in colexification and phonology.
We construct large-scale graphs incorporating semantic, genealogical, phonological and geographical data for 1,966 languages.
arXiv Detail & Related papers (2024-01-03T12:05:38Z) - Patterns of Closeness and Abstractness in Colexifications: The Case of
Indigenous Languages in the Americas [3.7055269158186874]
Colexification refers to linguistic phenomena where multiple concepts (meanings) are expressed by the same lexical form.
In this paper, we hypothesize that concepts that are closer in concreteness/abstractness are more likey to colexify, and we test the hypothesis across indigenous languages in Americas.
arXiv Detail & Related papers (2023-12-18T10:06:50Z) - Analogy in Contact: Modeling Maltese Plural Inflection [4.83828446399992]
We quantify the extent to which the phonology and etymology of a Maltese singular noun may predict the morphological process.
The results indicate phonological pressures shape the organization of the Maltese lexicon with predictive power.
arXiv Detail & Related papers (2023-05-20T20:16:57Z) - Decomposing lexical and compositional syntax and semantics with deep
language models [82.81964713263483]
The activations of language transformers like GPT2 have been shown to linearly map onto brain activity during speech comprehension.
Here, we propose a taxonomy to factorize the high-dimensional activations of language models into four classes: lexical, compositional, syntactic, and semantic representations.
The results highlight two findings. First, compositional representations recruit a more widespread cortical network than lexical ones, and encompass the bilateral temporal, parietal and prefrontal cortices.
arXiv Detail & Related papers (2021-03-02T10:24:05Z) - Disambiguatory Signals are Stronger in Word-initial Positions [48.18148856974974]
We point out the confounds in existing methods for comparing the informativeness of segments early in the word versus later in the word.
We find evidence across hundreds of languages that indeed there is a cross-linguistic tendency to front-load information in words.
arXiv Detail & Related papers (2021-02-03T18:19:16Z) - Lexical semantic change for Ancient Greek and Latin [61.69697586178796]
Associating a word's correct meaning in its historical context is a central challenge in diachronic research.
We build on a recent computational approach to semantic change based on a dynamic Bayesian mixture model.
We provide a systematic comparison of dynamic Bayesian mixture models for semantic change with state-of-the-art embedding-based models.
arXiv Detail & Related papers (2021-01-22T12:04:08Z) - Rediscovering the Slavic Continuum in Representations Emerging from
Neural Models of Spoken Language Identification [16.369477141866405]
We present a neural model for Slavic language identification in speech signals.
We analyze its emergent representations to investigate whether they reflect objective measures of language relatedness.
arXiv Detail & Related papers (2020-10-22T18:18:19Z) - The Typology of Polysemy: A Multilingual Distributional Framework [6.753781783859273]
We present a novel framework that quantifies semantic affinity, the cross-linguistic similarity of lexical semantics for a concept.
Our results reveal an intricate interaction between semantic domains and extra-linguistic factors, beyond language phylogeny.
arXiv Detail & Related papers (2020-06-02T22:31:40Z) - In search of isoglosses: continuous and discrete language embeddings in
Slavic historical phonology [0.0]
We employ three different types of language embedding (dense, sigmoid, and straight-through)
We find that the Straight-Through model outperforms the other two in terms of accuracy, but the Sigmoid model's language embeddings show the strongest agreement with the traditional subgrouping of the Slavic languages.
arXiv Detail & Related papers (2020-05-27T18:10:46Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z) - Where New Words Are Born: Distributional Semantic Analysis of Neologisms
and Their Semantic Neighborhoods [51.34667808471513]
We investigate the importance of two factors, semantic sparsity and frequency growth rates of semantic neighbors, formalized in the distributional semantics paradigm.
We show that both factors are predictive word emergence although we find more support for the latter hypothesis.
arXiv Detail & Related papers (2020-01-21T19:09:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.