A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space
- URL: http://arxiv.org/abs/2109.06324v1
- Date: Mon, 13 Sep 2021 21:05:37 GMT
- Title: A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space
- Authors: Alex Jones and William Yang Wang and Kyle Mahowald
- Abstract summary: In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
- Score: 61.18554842370824
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In cross-lingual language models, representations for many different
languages live in the same space. Here, we investigate the linguistic and
non-linguistic factors affecting sentence-level alignment in cross-lingual
pretrained language models for 101 languages and 5,050 language pairs. Using
BERT-based LaBSE and BiLSTM-based LASER as our models, and the Bible as our
corpus, we compute a task-based measure of cross-lingual alignment in the form
of bitext retrieval performance, as well as four intrinsic measures of vector
space alignment and isomorphism. We then examine a range of linguistic,
quasi-linguistic, and training-related features as potential predictors of
these alignment metrics. The results of our analyses show that word order
agreement and agreement in morphological complexity are two of the strongest
linguistic predictors of cross-linguality. We also note in-family training data
as a stronger predictor than language-specific training data across the board.
We verify some of our linguistic findings by looking at the effect of
morphological segmentation on English-Inuktitut alignment, in addition to
examining the effect of word order agreement on isomorphism for 66 zero-shot
language pairs from a different corpus. We make the data and code for our
experiments publicly available.
Related papers
- The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments [57.273662221547056]
In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance.
We observe that the existence of a predominant language during training boosts the performance of less frequent languages.
As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.
arXiv Detail & Related papers (2024-04-11T17:58:05Z) - Event Extraction in Basque: Typologically motivated Cross-Lingual Transfer-Learning Analysis [18.25948580496853]
Cross-lingual transfer-learning is widely used in Event Extraction for low-resource languages.
This paper studies whether the typological similarity between source and target languages impacts the performance of cross-lingual transfer.
arXiv Detail & Related papers (2024-04-09T15:35:41Z) - GradSim: Gradient-Based Language Grouping for Effective Multilingual
Training [13.730907708289331]
We propose GradSim, a language grouping method based on gradient similarity.
Our experiments on three diverse multilingual benchmark datasets show that it leads to the largest performance gains.
Besides linguistic features, the topics of the datasets play an important role for language grouping.
arXiv Detail & Related papers (2023-10-23T18:13:37Z) - The Geometry of Multilingual Language Model Representations [25.880639246639323]
We assess how multilingual language models maintain a shared multilingual representation space while still encoding language-sensitive information in each language.
The subspace means differ along language-sensitive axes that are relatively stable throughout middle layers, and these axes encode information such as token vocabularies.
We visualize representations projected onto language-sensitive and language-neutral axes, identifying language family and part-of-speech clusters, along with spirals, toruses, and curves representing token position information.
arXiv Detail & Related papers (2022-05-22T23:58:24Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Investigating Language Impact in Bilingual Approaches for Computational
Language Documentation [28.838960956506018]
This paper investigates how the choice of translation language affects the posterior documentation work.
We create 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment.
Our results suggest that incorporating clues into the neural models' input representation increases their translation and alignment quality.
arXiv Detail & Related papers (2020-03-30T10:30:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.