From Isolates to Families: Using Neural Networks for Automated Language Affiliation
- URL: http://arxiv.org/abs/2502.11688v1
- Date: Mon, 17 Feb 2025 11:25:32 GMT
- Title: From Isolates to Families: Using Neural Networks for Automated Language Affiliation
- Authors: Frederic Blum, Steffen Herbold, Johann-Mattis List,
- Abstract summary: In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow.
Large-scale standardized collections of multilingual wordlists and grammatical language structures might help to improve this and open new avenues for developing automated language affiliation.
We present neural network models that use lexical and grammatical data from a worldwide sample of more than 1,000 languages with known affiliations to classify individual languages into families.
- Score: 9.182884165239996
- License:
- Abstract: In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow that relies on manually comparing individual languages. Large-scale standardized collections of multilingual wordlists and grammatical language structures might help to improve this and open new avenues for developing automated language affiliation workflows. Here, we present neural network models that use lexical and grammatical data from a worldwide sample of more than 1,000 languages with known affiliations to classify individual languages into families. In line with the traditional assumption of most linguists, our results show that models trained on lexical data alone outperform models solely based on grammatical data, whereas combining both types of data yields even better performance. In additional experiments, we show how our models can identify long-ranging relations between entire subgroups, how they can be employed to investigate potential relatives of linguistic isolates, and how they can help us to obtain first hints on the affiliation of so far unaffiliated languages. We conclude that models for automated language affiliation trained on lexical and grammatical data provide comparative linguists with a valuable tool for evaluating hypotheses about deep and unknown language relations.
Related papers
- Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas [7.585433383340306]
We show that tokenization-free, phoneme- and grapheme-based language models can achieve strong linguistic performance.
Our findings suggest a promising direction for creating more linguistically plausible language models.
arXiv Detail & Related papers (2024-10-02T12:36:08Z) - Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - Language Embeddings Sometimes Contain Typological Generalizations [0.0]
We train neural models for a range of natural language processing tasks on a massively multilingual dataset of Bible translations in 1295 languages.
The learned language representations are then compared to existing typological databases as well as to a novel set of quantitative syntactic and morphological features.
We conclude that some generalizations are surprisingly close to traditional features from linguistic typology, but that most models, as well as those of previous work, do not appear to have made linguistically meaningful generalizations.
arXiv Detail & Related papers (2023-01-19T15:09:59Z) - Cross-Lingual Fine-Grained Entity Typing [26.973783464706447]
We present a unified cross-lingual fine-grained entity typing model capable of handling over 100 languages.
We analyze this model's ability to generalize to languages and entities unseen during training.
arXiv Detail & Related papers (2021-10-15T03:22:30Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - An Empirical Study of Factors Affecting Language-Independent Models [11.976665726887733]
We show that language-independent models can be comparable to or even outperforms the models trained using monolingual data.
We experiment language-independent models with many different languages and show that they are more suitable for typologically similar languages.
arXiv Detail & Related papers (2019-12-30T22:41:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.