Patterns of Persistence and Diffusibility across the World's Languages
- URL: http://arxiv.org/abs/2401.01698v2
- Date: Fri, 5 Jan 2024 15:33:40 GMT
- Title: Patterns of Persistence and Diffusibility across the World's Languages
- Authors: Yiyi Chen, Johannes Bjerva
- Abstract summary: Colexification is a type of similarity where a single lexical form is used to convey multiple meanings.
We shed light on the linguistic causes of cross-lingual similarity in colexification and phonology.
We construct large-scale graphs incorporating semantic, genealogical, phonological and geographical data for 1,966 languages.
- Score: 3.7055269158186874
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Language similarities can be caused by genetic relatedness, areal contact,
universality, or chance. Colexification, i.e. a type of similarity where a
single lexical form is used to convey multiple meanings, is underexplored. In
our work, we shed light on the linguistic causes of cross-lingual similarity in
colexification and phonology, by exploring genealogical stability (persistence)
and contact-induced change (diffusibility). We construct large-scale graphs
incorporating semantic, genealogical, phonological and geographical data for
1,966 languages. We then show the potential of this resource, by investigating
several established hypotheses from previous work in linguistics, while
proposing new ones. Our results strongly support a previously established
hypothesis in the linguistic literature, while offering contradicting evidence
to another. Our large scale resource opens for further research across
disciplines, e.g.~in multilingual NLP and comparative linguistics.
Related papers
- Exploring language relations through syntactic distances and geographic proximity [0.4369550829556578]
We explore linguistic distances using series of parts of speech (POS) extracted from the Universal Dependencies dataset.
We find definite clusters that correspond to well known language families and groups, with exceptions explained by distinct morphological typologies.
arXiv Detail & Related papers (2024-03-27T10:36:17Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - The Geometry of Multilingual Language Models: An Equality Lens [2.6746119935689214]
We analyze the geometry of three multilingual language models in Euclidean space.
Using a geometric separability index we find that although languages tend to be closer according to their linguistic family, they are almost separable with languages from other families.
arXiv Detail & Related papers (2023-05-13T05:19:15Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Analyzing the Surprising Variability in Word Embedding Stability Across
Languages [46.84861591608146]
We discuss linguistic properties that are related to stability, drawing out insights about correlations with affixing, language gender systems, and other features.
This has implications for embedding use, particularly in research that uses them to study language trends.
arXiv Detail & Related papers (2020-04-30T15:24:43Z) - On the coexistence of competing languages [0.0]
We revisit the question of language competition, with an emphasis on uncovering the ways in which coexistence might emerge.
We find that this emergence is related to symmetry breaking, and explore two particular scenarios.
For each of these, the investigation of paradigmatic situations leads us to a quantitative understanding of the conditions leading to language coexistence.
arXiv Detail & Related papers (2020-03-10T14:06:55Z) - Where New Words Are Born: Distributional Semantic Analysis of Neologisms
and Their Semantic Neighborhoods [51.34667808471513]
We investigate the importance of two factors, semantic sparsity and frequency growth rates of semantic neighbors, formalized in the distributional semantics paradigm.
We show that both factors are predictive word emergence although we find more support for the latter hypothesis.
arXiv Detail & Related papers (2020-01-21T19:09:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.