Subword-Based Comparative Linguistics across 242 Languages Using Wikipedia Glottosets
- URL: http://arxiv.org/abs/2601.18791v1
- Date: Mon, 26 Jan 2026 18:55:28 GMT
- Title: Subword-Based Comparative Linguistics across 242 Languages Using Wikipedia Glottosets
- Authors: Iaroslav Chelombitko, Mika Hämäläinen, Aleksey Komissarov,
- Abstract summary: We present a large-scale comparative study of 242 Latin and Cyrillic-script languages using subword-based methodologies.<n>Our approach utilizes Wikipedia rank-based subword vectors to analyze vocabulary, lexical divergence, and language similarity at scale.
- Score: 0.1682277069379282
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a large-scale comparative study of 242 Latin and Cyrillic-script languages using subword-based methodologies. By constructing 'glottosets' from Wikipedia lexicons, we introduce a framework for simultaneous cross-linguistic comparison via Byte-Pair Encoding (BPE). Our approach utilizes rank-based subword vectors to analyze vocabulary overlap, lexical divergence, and language similarity at scale. Evaluations demonstrate that BPE segmentation aligns with morpheme boundaries 95% better than random baseline across 15 languages (F1 = 0.34 vs 0.15). BPE vocabulary similarity correlates significantly with genetic language relatedness (Mantel r = 0.329, p < 0.001), with Romance languages forming the tightest cluster (mean distance 0.51) and cross-family pairs showing clear separation (0.82). Analysis of 26,939 cross-linguistic homographs reveals that 48.7% receive different segmentations across related languages, with variation correlating to phylogenetic distance. Our results provide quantitative macro-linguistic insights into lexical patterns across typologically diverse languages within a unified analytical framework.
Related papers
- Universal Conceptual Structure in Neural Translation: Probing NLLB-200's Multilingual Geometry [0.0]
We investigate the representation geometry of Meta's NLLB-200, a 200-language encoder-decoder Transformer.<n>We find that the model's embedding distances significantly correlate with phylogenetic distances from the Automated Similarity Judgment Program.<n>We release InterpretCognates, an open-source interactive toolkit for exploring these phenomena.
arXiv Detail & Related papers (2026-02-27T22:51:01Z) - Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world? [0.7168794329741259]
This study employs embeddings from the fine-tuned XLS-R self-supervised language identification model vox107-xls-r-300m-wav2vec to analyze relationships between 106 world languages.<n>Using linear discriminant analysis (LDA), language embeddings are clustered and compared with genealogical, lexical, and geographical distances.<n>The results demonstrate that embedding-based distances align closely with traditional measures, effectively capturing both global and local typological patterns.
arXiv Detail & Related papers (2025-06-10T08:33:34Z) - Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models [55.14276067678253]
This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in Large Language Models (LLMs)<n>We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models.<n>Further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns.
arXiv Detail & Related papers (2025-05-24T12:31:27Z) - Exploring language relations through syntactic distances and geographic proximity [0.4369550829556578]
We explore linguistic distances using series of parts of speech (POS) extracted from the Universal Dependencies dataset.
We find definite clusters that correspond to well known language families and groups, with exceptions explained by distinct morphological typologies.
arXiv Detail & Related papers (2024-03-27T10:36:17Z) - Sentiment Classification of Code-Switched Text using Pre-trained
Multilingual Embeddings and Segmentation [1.290382979353427]
We propose a multi-step natural language processing algorithm for code-switched sentiment analysis.
The proposed algorithm can be expanded for sentiment analysis of multiple languages with limited human expertise.
arXiv Detail & Related papers (2022-10-29T01:52:25Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Phonotactic Complexity and its Trade-offs [73.10961848460613]
This simple measure allows us to compare the entropy across languages.
We demonstrate a very strong negative correlation of -0.74 between bits per phoneme and the average length of words.
arXiv Detail & Related papers (2020-05-07T21:36:59Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.