Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations
- URL: http://arxiv.org/abs/2509.05060v1
- Date: Fri, 05 Sep 2025 12:40:31 GMT
- Title: Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations
- Authors: Patrick Amadeus Irawan, Ryandito Diandaru, Belati Jagad Bintang Syuhada, Randy Zakya Suchrady, Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya,
- Abstract summary: We introduce Entropy2Vec, a framework for deriving cross-lingual language representations by leveraging the entropy of monolingual language models.<n>By training a language model on a single language, we hypothesize that the entropy of its predictions reflects its structural similarity to other languages.<n>This approach yields dense, non-sparse language embeddings that are adaptable to different timeframes and free from missing values.
- Score: 33.52308723119687
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Entropy2Vec, a novel framework for deriving cross-lingual language representations by leveraging the entropy of monolingual language models. Unlike traditional typological inventories that suffer from feature sparsity and static snapshots, Entropy2Vec uses the inherent uncertainty in language models to capture typological relationships between languages. By training a language model on a single language, we hypothesize that the entropy of its predictions reflects its structural similarity to other languages: Low entropy indicates high similarity, while high entropy suggests greater divergence. This approach yields dense, non-sparse language embeddings that are adaptable to different timeframes and free from missing values. Empirical evaluations demonstrate that Entropy2Vec embeddings align with established typological categories and achieved competitive performance in downstream multilingual NLP tasks, such as those addressed by the LinguAlchemy framework.
Related papers
- Benchmarking Concept-Spilling Across Languages in LLMs [7.577675422356702]
Large Language Models (LLMs) exhibit remarkable cross-lingual abilities, yet often exhibit a systematic bias toward representations from other languages.<n>This paper presents a novel comparative framework for evaluating multilingual semantic robustness by measuring how models handle polysemous words across languages.
arXiv Detail & Related papers (2026-01-18T19:28:26Z) - Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation [49.2073409243885]
Large language models (LLMs) excel at generating English counterfactuals and demonstrate multilingual proficiency.<n>We conduct automatic evaluations on both directly generated counterfactuals in the target languages and those derived via English translation across six languages.<n>We identify and categorize four main types of errors that consistently appear in the generated counterfactuals across languages.
arXiv Detail & Related papers (2026-01-01T08:53:49Z) - Analyzing The Language of Visual Tokens [48.62180485759458]
We take a natural-language-centric approach to analyzing discrete visual languages.
We show that higher token innovation drives greater entropy and lower compression, with tokens predominantly representing object parts.
We also show that visual languages lack cohesive grammatical structures, leading to higher perplexity and weaker hierarchical organization compared to natural languages.
arXiv Detail & Related papers (2024-11-07T18:59:28Z) - Linguistically Grounded Analysis of Language Models using Shapley Head Values [2.914115079173979]
We investigate the processing of morphosyntactic phenomena by leveraging a recently proposed method for probing language models via Shapley Head Values (SHVs)<n>Using the English language BLiMP dataset, we test our approach on two widely used models, BERT and RoBERTa, and compare how linguistic constructions are handled.<n>Our results show that SHV-based attributions reveal distinct patterns across both models, providing insights into how language models organize and process linguistic information.
arXiv Detail & Related papers (2024-10-17T09:48:08Z) - Robustness of the Random Language Model [0.0]
The model suggests a simple picture of first language learning as a type of annealing in the vast space of potential languages.
It implies a single continuous transition to grammatical syntax, at which the symmetry among potential words and categories is spontaneously broken.
Results are discussed in light of theory of first-language acquisition in linguistics, and recent successes in machine learning.
arXiv Detail & Related papers (2023-09-26T13:14:35Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Constructing a Family Tree of Ten Indo-European Languages with
Delexicalized Cross-linguistic Transfer Patterns [57.86480614673034]
We formalize the delexicalized transfer as interpretable tree-to-string and tree-to-tree patterns.
This allows us to quantitatively probe cross-linguistic transfer and extend inquiries of Second Language Acquisition.
arXiv Detail & Related papers (2020-07-17T15:56:54Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Do Neural Language Models Show Preferences for Syntactic Formalisms? [14.388237635684737]
We study the extent to which the semblance of syntactic structure captured by language models adheres to a surface-syntactic or deep syntactic style of analysis.
We apply a probe for extracting directed dependency trees to BERT and ELMo models trained on 13 different languages.
We find that both models exhibit a preference for UD over SUD - with interesting variations across languages and layers.
arXiv Detail & Related papers (2020-04-29T11:37:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.