Related papers: Concept Space Alignment in Multilingual LLMs

Concept Space Alignment in Multilingual LLMs

URL: http://arxiv.org/abs/2410.01079v1
Date: Tue, 1 Oct 2024 21:21:00 GMT
Title: Concept Space Alignment in Multilingual LLMs
Authors: Qiwei Peng, Anders Søgaard,
Abstract summary: We show that multilingual models suffer from two familiar weaknesses: generalization works best for languages with similar typology, and for abstract concepts. For some models, prompt-based embeddings align better than word embeddings, but the projections are less linear.
Score: 47.633314194898134
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multilingual large language models (LLMs) seem to generalize somewhat across languages. We hypothesize this is a result of implicit vector space alignment. Evaluating such alignment, we see that larger models exhibit very high-quality linear alignments between corresponding concepts in different languages. Our experiments show that multilingual LLMs suffer from two familiar weaknesses: generalization works best for languages with similar typology, and for abstract concepts. For some models, e.g., the Llama-2 family of models, prompt-based embeddings align better than word embeddings, but the projections are less linear -- an observation that holds across almost all model families, indicating that some of the implicitly learned alignments are broken somewhat by prompt-based methods.

Related papers

The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context [0.9130277390156759]
Alignment tuning has enabled large language models to excel in reasoning, instruction-following, and minimizing harmful generations. Despite their widespread deployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages. Current alignment methods predominantly focus on English, leaving it unclear how alignment mechanism generalizes to multilingual settings.
arXiv Detail & Related papers (2025-04-03T15:46:46Z)
High-Dimensional Interlingual Representations of Large Language Models [65.77317753001954]
Large language models (LLMs) trained on massive multilingual datasets hint at the formation of interlingual constructs. We explore 31 diverse languages varying on their resource-levels, typologies, and geographical regions. We find that multilingual LLMs exhibit inconsistent cross-lingual alignments.
arXiv Detail & Related papers (2025-03-14T10:39:27Z)
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages [15.203789021094982]
In large language models (LLMs), how are multiple languages learned and encoded? We train sparse autoencoders on Llama-3-8B and Aya-23-8B, and demonstrate that abstract grammatical concepts are often encoded in feature directions shared across many languages.
arXiv Detail & Related papers (2025-01-10T21:18:21Z)
Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts. We find that Llama Instruct and Mistral models exhibit high degrees of language confusion. We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, effectively being crosslingual? This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
Hyperpolyglot LLMs: Cross-Lingual Interpretability in Token Embeddings [4.2243058640527575]
Cross-lingual transfer learning is an important property of multilingual large language models (LLMs) Our research opens the door for investigations in 1) The effect of pre-training and model architectures on representations of languages and 2) The applications of cross-lingual representations embedded in language models.
arXiv Detail & Related papers (2023-11-29T19:20:14Z)
Counterfactually Probing Language Identity in Multilingual Models [15.260518230218414]
We use AlterRep, a method of counterfactual probing, to explore the internal structure of multilingual models. We find that, given a template in Language X, pushing towards Language Y systematically increases the probability of Language Y words.
arXiv Detail & Related papers (2023-10-29T01:21:36Z)
Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis. We cluster all the target languages into multiple groups and name each group as a representation sprachbund. Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z)
ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora [21.78571365050787]
ERNIE-M is a new training method that encourages the model to align the representation of multiple languages with monolingual corpora. We generate pseudo-parallel sentences pairs on a monolingual corpus to enable the learning of semantic alignment between different languages. Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results on various cross-lingual downstream tasks.
arXiv Detail & Related papers (2020-12-31T15:52:27Z)
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus. Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
LNMap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space [17.49073364781107]
We propose a novel semi-supervised method to learn cross-lingual word embeddings for bilingual lexicon induction. Our model is independent of the isomorphic assumption and uses nonlinear mapping in the latent space of two independently trained auto-encoders.
arXiv Detail & Related papers (2020-04-28T23:28:26Z)
On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics. Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings. We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)
Refinement of Unsupervised Cross-Lingual Word Embeddings [2.4366811507669124]
Cross-lingual word embeddings aim to bridge the gap between high-resource and low-resource languages. We propose a self-supervised method to refine the alignment of unsupervised bilingual word embeddings.
arXiv Detail & Related papers (2020-02-21T10:39:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.