Exploring Anisotropy and Outliers in Multilingual Language Models for
Cross-Lingual Semantic Sentence Similarity
- URL: http://arxiv.org/abs/2306.00458v2
- Date: Wed, 7 Jun 2023 12:55:49 GMT
- Title: Exploring Anisotropy and Outliers in Multilingual Language Models for
Cross-Lingual Semantic Sentence Similarity
- Authors: Katharina H\"ammerl, Alina Fastowski, Jind\v{r}ich Libovick\'y,
Alexander Fraser
- Abstract summary: Previous work has shown that the representations output by contextual language models are more anisotropic than static type embeddings.
This seems to be true for both monolingual and multilingual models, although much less work has been done on the multilingual context.
We investigate outlier dimensions and their relationship to anisotropy in multiple pre-trained multilingual language models.
- Score: 64.18762301574954
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous work has shown that the representations output by contextual
language models are more anisotropic than static type embeddings, and typically
display outlier dimensions. This seems to be true for both monolingual and
multilingual models, although much less work has been done on the multilingual
context. Why these outliers occur and how they affect the representations is
still an active area of research. We investigate outlier dimensions and their
relationship to anisotropy in multiple pre-trained multilingual language
models. We focus on cross-lingual semantic similarity tasks, as these are
natural tasks for evaluating multilingual representations. Specifically, we
examine sentence representations. Sentence transformers which are fine-tuned on
parallel resources (that are not always available) perform better on this task,
and we show that their representations are more isotropic. However, we aim to
improve multilingual representations in general. We investigate how much of the
performance difference can be made up by only transforming the embedding space
without fine-tuning, and visualise the resulting spaces. We test different
operations: Removing individual outlier dimensions, cluster-based isotropy
enhancement, and ZCA whitening. We publish our code for reproducibility.
Related papers
- Exploring Representational Disparities Between Multilingual and Bilingual Translation Models [16.746335565636976]
Some language pairs in multilingual models can see worse performance than in bilingual models, especially in the one-to-many translation setting.
We show that for a given language pair, its multilingual model decoder representations are consistently less isotropic and occupy fewer dimensions than comparable bilingual model decoder representations.
arXiv Detail & Related papers (2023-05-23T16:46:18Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Learning an Artificial Language for Knowledge-Sharing in Multilingual
Translation [15.32063273544696]
We discretize the latent space of multilingual models by assigning encoder states to entries in a codebook.
We validate our approach on large-scale experiments with realistic data volumes and domains.
We also use the learned artificial language to analyze model behavior, and discover that using a similar bridge language increases knowledge-sharing among the remaining languages.
arXiv Detail & Related papers (2022-11-02T17:14:42Z) - Does Transliteration Help Multilingual Language Modeling? [0.0]
We empirically measure the effect of transliteration on Multilingual Language Models.
We focus on the Indic languages, which have the highest script diversity in the world.
We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages.
arXiv Detail & Related papers (2022-01-29T05:48:42Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - How Good is Your Tokenizer? On the Monolingual Performance of
Multilingual Language Models [96.32118305166412]
We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks.
We find that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts.
arXiv Detail & Related papers (2020-12-31T14:11:00Z) - What makes multilingual BERT multilingual? [60.9051207862378]
In this work, we provide an in-depth experimental study to supplement the existing literature of cross-lingual ability.
We compare the cross-lingual ability of non-contextualized and contextualized representation model with the same data.
We found that datasize and context window size are crucial factors to the transferability.
arXiv Detail & Related papers (2020-10-20T05:41:56Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.