Massively Multilingual Lexical Specialization of Multilingual
Transformers
- URL: http://arxiv.org/abs/2208.01018v3
- Date: Mon, 29 May 2023 13:59:07 GMT
- Title: Massively Multilingual Lexical Specialization of Multilingual
Transformers
- Authors: Tommaso Green and Simone Paolo Ponzetto and Goran Glava\v{s}
- Abstract summary: We show that massively multilingual lexical specialization brings substantial gains in two standard cross-lingual lexical tasks.
We observe gains for languages unseen in specialization, indicating that multilingual lexical specialization enables generalization to languages with no lexical constraints.
- Score: 18.766379322798837
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While pretrained language models (PLMs) primarily serve as general-purpose
text encoders that can be fine-tuned for a wide variety of downstream tasks,
recent work has shown that they can also be rewired to produce high-quality
word representations (i.e., static word embeddings) and yield good performance
in type-level lexical tasks. While existing work primarily focused on the
lexical specialization of monolingual PLMs with immense quantities of
monolingual constraints, in this work we expose massively multilingual
transformers (MMTs, e.g., mBERT or XLM-R) to multilingual lexical knowledge at
scale, leveraging BabelNet as the readily available rich source of multilingual
and cross-lingual type-level lexical knowledge. Concretely, we use BabelNet's
multilingual synsets to create synonym pairs (or synonym-gloss pairs) across 50
languages and then subject the MMTs (mBERT and XLM-R) to a lexical
specialization procedure guided by a contrastive objective. We show that such
massively multilingual lexical specialization brings substantial gains in two
standard cross-lingual lexical tasks, bilingual lexicon induction and
cross-lingual word similarity, as well as in cross-lingual sentence retrieval.
Crucially, we observe gains for languages unseen in specialization, indicating
that multilingual lexical specialization enables generalization to languages
with no lexical constraints. In a series of subsequent controlled experiments,
we show that the number of specialization constraints plays a much greater role
than the set of languages from which they originate.
Related papers
- Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - How Vocabulary Sharing Facilitates Multilingualism in LLaMA? [19.136382859468693]
Large Language Models (LLMs) often show strong performance on English tasks, while exhibiting limitations on other languages.
This study endeavors to examine the multilingual capability of LLMs from the vocabulary sharing perspective.
arXiv Detail & Related papers (2023-11-15T16:13:14Z) - Tokenization Impacts Multilingual Language Modeling: Assessing
Vocabulary Allocation and Overlap Across Languages [3.716965622352967]
We propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers.
Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks.
arXiv Detail & Related papers (2023-05-26T18:06:49Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Cross-Lingual Ability of Multilingual Masked Language Models: A Study of
Language Structure [54.01613740115601]
We study three language properties: constituent order, composition and word co-occurrence.
Our main conclusion is that the contribution of constituent order and word co-occurrence is limited, while the composition is more crucial to the success of cross-linguistic transfer.
arXiv Detail & Related papers (2022-03-16T07:09:35Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.