Modality Matching Matters: Calibrating Language Distances for Cross-Lingual Transfer in URIEL+
- URL: http://arxiv.org/abs/2510.19217v1
- Date: Wed, 22 Oct 2025 03:59:19 GMT
- Title: Modality Matching Matters: Calibrating Language Distances for Cross-Lingual Transfer in URIEL+
- Authors: York Hay Ng, Aditya Khan, Xiang Lu, Matteo Salloum, Michael Zhou, Phuong H. Hoang, A. Seza Doğruöz, En-Shiun Annie Lee,
- Abstract summary: We introduce a framework for type-matched language distances.<n>We propose novel, structure-aware representations for each distance type.<n>We unify these signals into a robust, task-agnostic composite distance.
- Score: 4.262015666551064
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Existing linguistic knowledge bases such as URIEL+ provide valuable geographic, genetic and typological distances for cross-lingual transfer but suffer from two key limitations. One, their one-size-fits-all vector representations are ill-suited to the diverse structures of linguistic data, and two, they lack a principled method for aggregating these signals into a single, comprehensive score. In this paper, we address these gaps by introducing a framework for type-matched language distances. We propose novel, structure-aware representations for each distance type: speaker-weighted distributions for geography, hyperbolic embeddings for genealogy, and a latent variables model for typology. We unify these signals into a robust, task-agnostic composite distance. In selecting transfer languages, our representations and composite distances consistently improve performance across a wide range of NLP tasks, providing a more principled and effective toolkit for multilingual research.
Related papers
- Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning [81.43257201833154]
We propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities.<n>Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text.<n>The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets.
arXiv Detail & Related papers (2025-10-20T16:01:11Z) - Untangling the Influence of Typology, Data and Model Architecture on Ranking Transfer Languages for Cross-Lingual POS Tagging [1.743739675178893]
Cross-lingual transfer learning is an invaluable tool for overcoming data scarcity.<n>The precise roles of linguistic typology, training data, and model architecture in transfer language choice are not fully understood.<n>We take a holistic approach, examining how both dataset-specific and fine-grained typological features influence transfer language selection.
arXiv Detail & Related papers (2025-03-25T18:05:40Z) - High-Dimensional Interlingual Representations of Large Language Models [65.77317753001954]
Large language models (LLMs) trained on massive multilingual datasets hint at the formation of interlingual constructs.<n>We explore 31 diverse languages varying on their resource-levels, typologies, and geographical regions.<n>We find that multilingual LLMs exhibit inconsistent cross-lingual alignments.
arXiv Detail & Related papers (2025-03-14T10:39:27Z) - Zero-shot Cross-lingual Transfer Learning with Multiple Source and Target Languages for Information Extraction: Language Selection and Adversarial Training [38.19963761398705]
This paper provides a detailed analysis on Cross-Lingual Multi-Transferability (many-to-many transfer learning) for the recent IE corpora.
We first determine the correlation between single-language performance and a wide range of linguistic-based distances.
Next, we investigate the more general zero-shot multi-lingual transfer settings where multiple languages are involved in the training and evaluation processes.
arXiv Detail & Related papers (2024-11-13T17:13:25Z) - Cross-Lingual Transfer Robustness to Lower-Resource Languages on Adversarial Datasets [4.653113033432781]
Cross-lingual transfer capabilities of Multilingual Language Models (MLLMs) are investigated.
Our research provides valuable insights into cross-lingual transfer and its implications for NLP applications.
arXiv Detail & Related papers (2024-03-29T08:47:15Z) - Learning Domain-Specialised Representations for Cross-Lingual Biomedical
Entity Linking [66.76141128555099]
We propose a novel cross-lingual biomedical entity linking task (XL-BEL)
We first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task.
We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones.
arXiv Detail & Related papers (2021-05-30T00:50:00Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Probing Multilingual BERT for Genetic and Typological Signals [28.360662552057324]
We probe the layers in multilingual BERT (mBERT) for phylogenetic and geographic language signals across 100 languages.
We employ the language distances to infer and evaluate language trees, finding that they are close to the reference family tree in terms of quartet tree distance.
arXiv Detail & Related papers (2020-11-04T00:03:04Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.