An Efficient Multilingual Language Model Compression through Vocabulary
Trimming
- URL: http://arxiv.org/abs/2305.15020v3
- Date: Thu, 19 Oct 2023 10:30:02 GMT
- Title: An Efficient Multilingual Language Model Compression through Vocabulary
Trimming
- Authors: Asahi Ushio and Yi Zhou and Jose Camacho-Collados
- Abstract summary: vocabulary-trimming (VT) is a method to reduce a multilingual LM vocabulary to a target language by deleting irrelevant tokens from its vocabulary.
In our experiments, we show that VT can retain the original performance of the multilingual LM, while being smaller in size.
This methodology can keep the best of both monolingual and multilingual worlds by keeping a small size as monolingual models.
- Score: 16.568276582466833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multilingual language model (LM) have become a powerful tool in NLP
especially for non-English languages. Nevertheless, model parameters of
multilingual LMs remain large due to the larger embedding matrix of the
vocabulary covering tokens in different languages. On the contrary, monolingual
LMs can be trained in a target language with the language-specific vocabulary
only, but this requires a large budget and availability of reliable corpora to
achieve a high-quality LM from scratch. In this paper, we propose
vocabulary-trimming (VT), a method to reduce a multilingual LM vocabulary to a
target language by deleting irrelevant tokens from its vocabulary. In theory,
VT can compress any existing multilingual LM to build monolingual LMs in any
language covered by the multilingual LM. In our experiments, we show that VT
can retain the original performance of the multilingual LM, while being smaller
in size (in general around 50% of the original vocabulary size is enough) than
the original multilingual LM. The evaluation is performed over four NLP tasks
(two generative and two classification tasks) among four widely used
multilingual LMs in seven languages. Finally, we show that this methodology can
keep the best of both monolingual and multilingual worlds by keeping a small
size as monolingual models without the need for specifically retraining them,
and even limiting potentially harmful social biases.
Related papers
- Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed? [40.13166574854085]
We investigate the minimal amount of multilinguality required to elicit cross-lingual generalisation in English-centric large language models.
We find that multilingual instruction tuning with as few as two to three languages is both necessary and sufficient to elicit effective cross-lingual generalisation.
arXiv Detail & Related papers (2023-12-20T00:49:52Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - How Vocabulary Sharing Facilitates Multilingualism in LLaMA? [19.136382859468693]
Large Language Models (LLMs) often show strong performance on English tasks, while exhibiting limitations on other languages.
This study endeavors to examine the multilingual capability of LLMs from the vocabulary sharing perspective.
arXiv Detail & Related papers (2023-11-15T16:13:14Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Multilingual Language Model Adaptive Fine-Tuning: A Study on African
Languages [19.067718464786463]
We perform multilingual adaptive fine-tuning (MAFT) on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent.
To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT.
Our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space.
arXiv Detail & Related papers (2022-04-13T16:13:49Z) - Exploiting Language Relatedness for Low Web-Resource Language Model
Adaptation: An Indic Languages Study [14.34516262614775]
We argue that relatedness among languages in a language family may be exploited to overcome some of the corpora limitations of LRLs.
We focus on Indian languages, and exploit relatedness along two dimensions: (1) script (since many Indic scripts originated from the Brahmic script) and (2) sentence structure.
arXiv Detail & Related papers (2021-06-07T20:43:02Z) - Reusing a Pretrained Language Model on Languages with Limited Corpora
for Unsupervised NMT [129.99918589405675]
We present an effective approach that reuses an LM that is pretrained only on the high-resource language.
The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model.
Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq)
arXiv Detail & Related papers (2020-09-16T11:37:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.