Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity
- URL: http://arxiv.org/abs/2003.04866v1
- Date: Tue, 10 Mar 2020 17:17:01 GMT
- Title: Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity
- Authors: Ivan Vuli\'c, Simon Baker, Edoardo Maria Ponti, Ulla Petti, Ira
Leviant, Kelly Wing, Olga Majewska, Eden Bar, Matt Malone, Thierry Poibeau,
Roi Reichart, Anna Korhonen
- Abstract summary: Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
- Score: 67.36239720463657
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Multi-SimLex, a large-scale lexical resource and evaluation
benchmark covering datasets for 12 typologically diverse languages, including
major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as
less-resourced ones (e.g., Welsh, Kiswahili). Each language dataset is
annotated for the lexical relation of semantic similarity and contains 1,888
semantically aligned concept pairs, providing a representative coverage of word
classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity
intervals, lexical fields, and concreteness levels. Additionally, owing to the
alignment of concepts across languages, we provide a suite of 66 cross-lingual
semantic similarity datasets. Due to its extensive size and language coverage,
Multi-SimLex provides entirely novel opportunities for experimental evaluation
and analysis. On its monolingual and cross-lingual benchmarks, we evaluate and
analyze a wide array of recent state-of-the-art monolingual and cross-lingual
representation models, including static and contextualized word embeddings
(such as fastText, M-BERT and XLM), externally informed lexical
representations, as well as fully unsupervised and (weakly) supervised
cross-lingual word embeddings. We also present a step-by-step dataset creation
protocol for creating consistent, Multi-Simlex-style resources for additional
languages. We make these contributions -- the public release of Multi-SimLex
datasets, their creation protocol, strong baseline results, and in-depth
analyses which can be be helpful in guiding future developments in multilingual
lexical semantics and representation learning -- available via a website which
will encourage community effort in further expansion of Multi-Simlex to many
more languages. Such a large-scale semantic resource could inspire significant
further advances in NLP across languages.
Related papers
- The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Tokenization Impacts Multilingual Language Modeling: Assessing
Vocabulary Allocation and Overlap Across Languages [3.716965622352967]
We propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers.
Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks.
arXiv Detail & Related papers (2023-05-26T18:06:49Z) - Massively Multilingual Lexical Specialization of Multilingual
Transformers [18.766379322798837]
We show that massively multilingual lexical specialization brings substantial gains in two standard cross-lingual lexical tasks.
We observe gains for languages unseen in specialization, indicating that multilingual lexical specialization enables generalization to languages with no lexical constraints.
arXiv Detail & Related papers (2022-08-01T17:47:03Z) - The Geometry of Multilingual Language Model Representations [25.880639246639323]
We assess how multilingual language models maintain a shared multilingual representation space while still encoding language-sensitive information in each language.
The subspace means differ along language-sensitive axes that are relatively stable throughout middle layers, and these axes encode information such as token vocabularies.
We visualize representations projected onto language-sensitive and language-neutral axes, identifying language family and part-of-speech clusters, along with spirals, toruses, and curves representing token position information.
arXiv Detail & Related papers (2022-05-22T23:58:24Z) - Multi-level Contrastive Learning for Cross-lingual Spoken Language
Understanding [90.87454350016121]
We develop novel code-switching schemes to generate hard negative examples for contrastive learning at all levels.
We develop a label-aware joint model to leverage label semantics for cross-lingual knowledge transfer.
arXiv Detail & Related papers (2022-05-07T13:44:28Z) - Examining Cross-lingual Contextual Embeddings with Orthogonal Structural
Probes [0.2538209532048867]
A novel Orthogonal Structural Probe (Limisiewicz and Marevcek, 2021) allows us to answer this question for specific linguistic features.
We evaluate syntactic (UD) and lexical (WordNet) structural information encoded inmBERT's contextual representations for nine diverse languages.
We successfully apply our findings to zero-shot and few-shot cross-lingual parsing.
arXiv Detail & Related papers (2021-09-10T15:03:11Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.