Examining Cross-lingual Contextual Embeddings with Orthogonal Structural
Probes
- URL: http://arxiv.org/abs/2109.04921v1
- Date: Fri, 10 Sep 2021 15:03:11 GMT
- Title: Examining Cross-lingual Contextual Embeddings with Orthogonal Structural
Probes
- Authors: Tomasz Limisiewicz and David Mare\v{c}ek
- Abstract summary: A novel Orthogonal Structural Probe (Limisiewicz and Marevcek, 2021) allows us to answer this question for specific linguistic features.
We evaluate syntactic (UD) and lexical (WordNet) structural information encoded inmBERT's contextual representations for nine diverse languages.
We successfully apply our findings to zero-shot and few-shot cross-lingual parsing.
- Score: 0.2538209532048867
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-of-the-art contextual embeddings are obtained from large language
models available only for a few languages. For others, we need to learn
representations using a multilingual model. There is an ongoing debate on
whether multilingual embeddings can be aligned in a space shared across many
languages. The novel Orthogonal Structural Probe (Limisiewicz and Mare\v{c}ek,
2021) allows us to answer this question for specific linguistic features and
learn a projection based only on mono-lingual annotated datasets. We evaluate
syntactic (UD) and lexical (WordNet) structural information encoded inmBERT's
contextual representations for nine diverse languages. We observe that for
languages closely related to English, no transformation is needed. The
evaluated information is encoded in a shared cross-lingual embedding space. For
other languages, it is beneficial to apply orthogonal transformation learned
separately for each language. We successfully apply our findings to zero-shot
and few-shot cross-lingual parsing.
Related papers
- The Geometry of Multilingual Language Model Representations [25.880639246639323]
We assess how multilingual language models maintain a shared multilingual representation space while still encoding language-sensitive information in each language.
The subspace means differ along language-sensitive axes that are relatively stable throughout middle layers, and these axes encode information such as token vocabularies.
We visualize representations projected onto language-sensitive and language-neutral axes, identifying language family and part-of-speech clusters, along with spirals, toruses, and curves representing token position information.
arXiv Detail & Related papers (2022-05-22T23:58:24Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Establishing Interlingua in Multilingual Language Models [0.0]
We show that different languages do converge to a shared space in large multilingual language models.
We extend our analysis to 28 diverse languages and find that the interlingual space exhibits a particular structure similar to the linguistic relatedness of languages.
arXiv Detail & Related papers (2021-09-02T20:53:14Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Language Embeddings for Typology and Cross-lingual Transfer Learning [11.647285195114256]
We generate dense embeddings for 29 languages using a denoising autoencoder.
We evaluate the embeddings using the World Atlas of Language Structures.
arXiv Detail & Related papers (2021-06-03T19:00:02Z) - Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding.
XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model.
Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Finding Universal Grammatical Relations in Multilingual BERT [47.74015366712623]
We show that subspaces of mBERT representations recover syntactic tree distances in languages other than English.
We present an unsupervised analysis method that provides evidence mBERT learns representations of syntactic dependency labels.
arXiv Detail & Related papers (2020-05-09T20:46:02Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.