Understanding Cross-Lingual Alignment -- A Survey
- URL: http://arxiv.org/abs/2404.06228v2
- Date: Tue, 11 Jun 2024 17:33:52 GMT
- Title: Understanding Cross-Lingual Alignment -- A Survey
- Authors: Katharina Hämmerl, Jindřich Libovický, Alexander Fraser,
- Abstract summary: Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
- Score: 52.572071017877704
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-lingual alignment, the meaningful similarity of representations across languages in multilingual language models, has been an active field of research in recent years. We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field. We present different understandings of cross-lingual alignment and their limitations. We provide a qualitative summary of results from a large number of surveyed papers. Finally, we discuss how these insights may be applied not only to encoder models, where this topic has been heavily studied, but also to encoder-decoder or even decoder-only models, and argue that an effective trade-off between language-neutral and language-specific information is key.
Related papers
- Locally Measuring Cross-lingual Lexical Alignment: A Domain and Word Level Perspective [15.221506468189345]
We present a methodology for analyzing both synthetic validations and a novel naturalistic validation using lexical gaps in the kinship domain.
Our analysis spans 16 diverse languages, demonstrating that there is substantial room for improvement with the use of newer language models.
arXiv Detail & Related papers (2024-10-07T16:37:32Z) - Exploring Alignment in Shared Cross-lingual Spaces [15.98134426166435]
We employ clustering to uncover latent concepts within multilingual models.
Our analysis focuses on quantifying the textitalignment and textitoverlap of these concepts across various languages.
Our study encompasses three multilingual models (textttmT5, texttmBERT, and textttXLM-R) and three downstream tasks (Machine Translation, Named Entity Recognition, and Sentiment Analysis)
arXiv Detail & Related papers (2024-05-23T13:20:24Z) - InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling [40.54497836775837]
Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics.
Most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries.
We propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM) to produce more coherent, diverse, and well-aligned topics.
arXiv Detail & Related papers (2023-04-07T08:49:43Z) - Cross-lingual Lifelong Learning [53.06904052325966]
We present a principled Cross-lingual Continual Learning (CCL) evaluation paradigm.
We provide insights into what makes multilingual sequential learning particularly challenging.
The implications of this analysis include a recipe for how to measure and balance different cross-lingual continual learning desiderata.
arXiv Detail & Related papers (2022-05-23T09:25:43Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive
Summarization [41.578594261746055]
We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems.
We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors.
We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article.
arXiv Detail & Related papers (2020-10-07T00:28:05Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.