The Secret is in the Spectra: Predicting Cross-lingual Task Performance
with Spectral Similarity Measures
- URL: http://arxiv.org/abs/2001.11136v2
- Date: Mon, 12 Oct 2020 15:13:13 GMT
- Title: The Secret is in the Spectra: Predicting Cross-lingual Task Performance
with Spectral Similarity Measures
- Authors: Haim Dubossarsky, Ivan Vuli\'c, Roi Reichart, Anna Korhonen
- Abstract summary: We present a large-scale study focused on the correlations between monolingual embedding space similarity and task performance.
We introduce several isomorphism measures between two embedding spaces, based on the relevant statistics of their individual spectra.
We empirically show that 1) language similarity scores derived from such spectral isomorphism measures are strongly associated with performance observed in different cross-lingual tasks.
- Score: 83.53361353172261
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Performance in cross-lingual NLP tasks is impacted by the (dis)similarity of
languages at hand: e.g., previous work has suggested there is a connection
between the expected success of bilingual lexicon induction (BLI) and the
assumption of (approximate) isomorphism between monolingual embedding spaces.
In this work we present a large-scale study focused on the correlations between
monolingual embedding space similarity and task performance, covering thousands
of language pairs and four different tasks: BLI, parsing, POS tagging and MT.
We hypothesize that statistics of the spectrum of each monolingual embedding
space indicate how well they can be aligned. We then introduce several
isomorphism measures between two embedding spaces, based on the relevant
statistics of their individual spectra. We empirically show that 1) language
similarity scores derived from such spectral isomorphism measures are strongly
associated with performance observed in different cross-lingual tasks, and 2)
our spectral-based measures consistently outperform previous standard
isomorphism measures, while being computationally more tractable and easier to
interpret. Finally, our measures capture complementary information to
typologically driven language distance measures, and the combination of
measures from the two families yields even higher task performance
correlations.
Related papers
- Probing the Emergence of Cross-lingual Alignment during LLM Training [10.053333786023089]
Multilingual Large Language Models (LLMs) achieve remarkable levels of zero-shot cross-lingual transfer performance.
We study how such cross-lingual alignment emerges during pre-training of LLMs.
We observe a high correlation between neuron overlap and downstream performance.
arXiv Detail & Related papers (2024-06-19T05:31:59Z) - Improving Multi-lingual Alignment Through Soft Contrastive Learning [9.454626745893798]
We propose a novel method to align multi-lingual embeddings based on the similarity of sentences measured by a pre-trained mono-lingual embedding model.
Given translation sentence pairs, we train a multi-lingual model in a way that the similarity between cross-lingual embeddings follows the similarity of sentences measured at the mono-lingual teacher model.
arXiv Detail & Related papers (2024-05-25T09:46:07Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - Bridging the Gap between Language Models and Cross-Lingual Sequence
Labeling [101.74165219364264]
Large-scale cross-lingual pre-trained language models (xPLMs) have shown effectiveness in cross-lingual sequence labeling tasks.
Despite the great success, we draw an empirical observation that there is a training objective gap between pre-training and fine-tuning stages.
In this paper, we first design a pre-training task tailored for xSL named Cross-lingual Language Informative Span Masking (CLISM) to eliminate the objective gap.
Second, we present ContrAstive-Consistency Regularization (CACR), which utilizes contrastive learning to encourage the consistency between representations of input parallel
arXiv Detail & Related papers (2022-04-11T15:55:20Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - On Learning Universal Representations Across Languages [37.555675157198145]
We extend existing approaches to learn sentence-level representations and show the effectiveness on cross-lingual understanding and generation.
Specifically, we propose a Hierarchical Contrastive Learning (HiCTL) method to learn universal representations for parallel sentences distributed in one or multiple languages.
We conduct evaluations on two challenging cross-lingual tasks, XTREME and machine translation.
arXiv Detail & Related papers (2020-07-31T10:58:39Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.