XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked
Language Models
- URL: http://arxiv.org/abs/2301.10472v2
- Date: Fri, 13 Oct 2023 18:36:06 GMT
- Title: XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked
Language Models
- Authors: Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan
Ghazvininejad, Luke Zettlemoyer, Madian Khabsa
- Abstract summary: We introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap.
We train XLM-V, a multilingual language model with a one million token vocabulary.
XLM-V is particularly effective on low-resource language tasks and outperforms XLM-R by 11.2% and 5.8% absolute on MasakhaNER and Americas NLI, respectively.
- Score: 100.29953199404905
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large multilingual language models typically rely on a single vocabulary
shared across 100+ languages. As these models have increased in parameter count
and depth, vocabulary size has remained largely unchanged. This
\textit{vocabulary bottleneck} limits the representational capabilities of
multilingual models like XLM-R. In this paper, we introduce a new approach for
scaling to very large multilingual vocabularies by de-emphasizing token sharing
between languages with little lexical overlap and assigning vocabulary capacity
to achieve sufficient coverage for each individual language. Tokenizations
using our vocabulary are typically more semantically meaningful and shorter
compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a
multilingual language model with a one million token vocabulary. XLM-V
outperforms XLM-R on every task we tested on ranging from natural language
inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), to named entity
recognition (WikiAnn). XLM-V is particularly effective on low-resource language
tasks and outperforms XLM-R by 11.2% and 5.8% absolute on MasakhaNER and
Americas NLI, respectively.
Related papers
- Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - How Vocabulary Sharing Facilitates Multilingualism in LLaMA? [19.136382859468693]
Large Language Models (LLMs) often show strong performance on English tasks, while exhibiting limitations on other languages.
This study endeavors to examine the multilingual capability of LLMs from the vocabulary sharing perspective.
arXiv Detail & Related papers (2023-11-15T16:13:14Z) - Translate to Disambiguate: Zero-shot Multilingual Word Sense
Disambiguation with Pretrained Language Models [67.19567060894563]
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks.
We present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT)
We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance.
arXiv Detail & Related papers (2023-04-26T19:55:52Z) - XLM-K: Improving Cross-Lingual Language Model Pre-Training with
Multilingual Knowledge [31.765178013933134]
Cross-lingual pre-training has achieved great successes using monolingual and bilingual plain text corpora.
We propose XLM-K, a cross-lingual language model incorporating multilingual knowledge in pre-training.
arXiv Detail & Related papers (2021-09-26T11:46:20Z) - Allocating Large Vocabulary Capacity for Cross-lingual Language Model
Pre-training [59.571632468137075]
We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity.
We propose an algorithm VoCap to determine the desired vocabulary capacity of each language.
In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax.
arXiv Detail & Related papers (2021-09-15T14:04:16Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Improving Multilingual Models with Language-Clustered Vocabularies [8.587129426070979]
We introduce a novel procedure for multilingual vocabulary generation that combines the separately trained vocabularies of several automatically derived language clusters.
Our experiments show improvements across languages on key multilingual benchmark tasks.
arXiv Detail & Related papers (2020-10-24T04:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.