FOCUS: Effective Embedding Initialization for Monolingual Specialization
of Multilingual Models
- URL: http://arxiv.org/abs/2305.14481v2
- Date: Mon, 6 Nov 2023 17:47:47 GMT
- Title: FOCUS: Effective Embedding Initialization for Monolingual Specialization
of Multilingual Models
- Authors: Konstantin Dobler and Gerard de Melo
- Abstract summary: FOCUS represents newly added tokens as combinations of tokens in the overlap of the source and target vocabularies.
We focus our study on using the multilingual XLM-R as a source model.
- Score: 26.598115320351496
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Using model weights pretrained on a high-resource language as a warm start
can reduce the need for data and compute to obtain high-quality language models
for other, especially low-resource, languages. However, if we want to use a new
tokenizer specialized for the target language, we cannot transfer the source
model's embedding matrix. In this paper, we propose FOCUS - Fast Overlapping
Token Combinations Using Sparsemax, a novel embedding initialization method
that initializes the embedding matrix effectively for a new tokenizer based on
information in the source model's embedding matrix. FOCUS represents newly
added tokens as combinations of tokens in the overlap of the source and target
vocabularies. The overlapping tokens are selected based on semantic similarity
in an auxiliary static token embedding space. We focus our study on using the
multilingual XLM-R as a source model and empirically show that FOCUS
outperforms random initialization and previous work in language modeling and on
a range of downstream tasks (NLI, QA, and NER).
Related papers
- MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer [50.40191599304911]
We introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer.
In this paper, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language.
We show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines.
arXiv Detail & Related papers (2024-01-09T21:09:07Z) - OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining [49.213120730582354]
Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining.
We propose a novel framework: $textbfO$ne $textbfF$or $textbfA$ll, which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively.
arXiv Detail & Related papers (2023-11-15T10:40:45Z) - Tik-to-Tok: Translating Language Models One Token at a Time: An
Embedding Initialization Strategy for Efficient Language Adaptation [19.624330093598996]
Training monolingual language models for low and mid-resource languages is made challenging by limited and often inadequate pretraining data.
By generalizing over a word translation dictionary encompassing both the source and target languages, we map tokens from the target tokenizer to semantically similar tokens from the source language tokenizer.
We conduct experiments to convert high-resource models to mid- and low-resource languages, namely Dutch and Frisian.
arXiv Detail & Related papers (2023-10-05T11:45:29Z) - Fusing Sentence Embeddings Into LSTM-based Autoregressive Language
Models [20.24851041248274]
We present an LSTM-based autoregressive language model which uses prefix embeddings (from a pretrained masked language model) via fusion.
We find that fusion helps reliably in lowering the perplexity (16.74 $rightarrow$ 15.80), which is even preserved after a transfer to a dataset from a different domain.
We also evaluate the best-performing fusion model by correlating its next word surprisal estimates with human reading times.
arXiv Detail & Related papers (2022-08-04T02:13:03Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - UniLMv2: Pseudo-Masked Language Models for Unified Language Model
Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks.
Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.