Related papers: Study of scaling laws in language families

Study of scaling laws in language families

URL: http://arxiv.org/abs/2504.01681v1
Date: Wed, 02 Apr 2025 12:28:59 GMT
Title: Study of scaling laws in language families
Authors: Maelyson R. F. Santos, Marcelo A. F. Gomes,
Abstract summary: This article investigates scaling laws within language families using data from over six thousand languages.<n>It analyzes emergent patterns observed in Zipf-like classification graphs.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This article investigates scaling laws within language families using data from over six thousand languages and analyzing emergent patterns observed in Zipf-like classification graphs. Both macroscopic (based on number of languages by family) and microscopic (based on numbers of speakers by language on a family) aspects of these classifications are examined. Particularly noteworthy is the discovery of a distinct division among the fourteen largest contemporary language families, excluding Afro-Asiatic and Nilo-Saharan languages. These families are found to be distributed across three language family quadruplets, each characterized by significantly different exponents in the Zipf graphs. This finding sheds light on the underlying structure and organization of major language families, revealing intriguing insights into the nature of linguistic diversity and distribution.

Related papers

Exploring language relations through syntactic distances and geographic proximity [0.4369550829556578]
We explore linguistic distances using series of parts of speech (POS) extracted from the Universal Dependencies dataset. We find definite clusters that correspond to well known language families and groups, with exceptions explained by distinct morphological typologies.
arXiv Detail & Related papers (2024-03-27T10:36:17Z)
Patterns of Persistence and Diffusibility across the World's Languages [3.7055269158186874]
Colexification is a type of similarity where a single lexical form is used to convey multiple meanings. We shed light on the linguistic causes of cross-lingual similarity in colexification and phonology. We construct large-scale graphs incorporating semantic, genealogical, phonological and geographical data for 1,966 languages.
arXiv Detail & Related papers (2024-01-03T12:05:38Z)
Clustering Pseudo Language Family in Multilingual Translation Models with Fisher Information Matrix [22.891944602891428]
Clustering languages based solely on their ancestral families can yield suboptimal results. We propose an innovative method that leverages the fisher information matrix (FIM) to cluster language families. We provide an in-depth discussion regarding the inception and application of these pseudo language families.
arXiv Detail & Related papers (2023-12-05T15:03:27Z)
Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z)
Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness [6.790979602996742]
Colexification refers to the linguistic phenomenon where a single lexical form is used to convey multiple meanings. We showcase curation procedures which result in a dataset covering 142 languages across 21 language families across the world. The dataset includes ratings of concreteness and affectiveness, mapped with phonemes and phonological features.
arXiv Detail & Related papers (2023-06-05T07:32:21Z)
The Geometry of Multilingual Language Models: An Equality Lens [2.6746119935689214]
We analyze the geometry of three multilingual language models in Euclidean space. Using a geometric separability index we find that although languages tend to be closer according to their linguistic family, they are almost separable with languages from other families.
arXiv Detail & Related papers (2023-05-13T05:19:15Z)
Comparing Spoken Languages using Paninian System of Sounds and Finite State Machines [0.0]
We propose an Ecosystem Model for Linguistic Development with Sanskrit at the core.<n>We represent words across languages as state transitions on the phonetic map and construct corresponding Morphological Finite Automata.
arXiv Detail & Related papers (2023-01-29T15:22:10Z)
A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space. We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance. We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z)
Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis. We cluster all the target languages into multiple groups and name each group as a representation sprachbund. Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z)
Automatically Identifying Language Family from Acoustic Examples in Low Resource Scenarios [48.57072884674938]
We propose a method to analyze language similarity using deep learning. Namely, we train a model on the Wilderness dataset and investigate how its latent space compares with classical language family findings.
arXiv Detail & Related papers (2020-12-01T22:44:42Z)
Linguistic Typology Features from Text: Inferring the Sparse Features of World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers. We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z)
Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source. We observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.