Related papers: Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries

Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries

URL: http://arxiv.org/abs/2506.01535v1
Date: Mon, 02 Jun 2025 10:52:52 GMT
Title: Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries
Authors: Haruki Sakajo, Yusuke Ide, Justin Vasselli, Yusuke Sakai, Yingtao Tian, Hidetaka Kamigaito, Taro Watanabe,
Abstract summary: Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages.<n>Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources.
Score: 22.562544826766917
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages. Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources. In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists. Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords. The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer. The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer.

Related papers

Adapters for Altering LLM Vocabularies: What Languages Benefit the Most? [23.83290627671739]
VocADT is a novel method for vocabulary adaptation using adapter modules that are trained to learn the optimal linear combination of existing embeddings.<n>We find that Latin-script languages and highly fragmented languages benefit the most from vocabulary adaptation.
arXiv Detail & Related papers (2024-10-12T20:45:24Z)
Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language. Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z)
Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages [20.17308477850864]
Pre-trained multilingual language models underpin a large portion of modern NLP tools outside of English. We propose several simple techniques to replace a cross-lingual vocabulary with a compact, language-specific one.
arXiv Detail & Related papers (2023-09-09T04:27:18Z)
Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages [3.716965622352967]
We propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers. Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks.
arXiv Detail & Related papers (2023-05-26T18:06:49Z)
Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs. We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models. We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z)
Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training [59.571632468137075]
We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. We propose an algorithm VoCap to determine the desired vocabulary capacity of each language. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax.
arXiv Detail & Related papers (2021-09-15T14:04:16Z)
Subword Mapping and Anchoring across Languages [1.9352552677009318]
Subword Mapping and Anchoring across Languages (SMALA) is a method to construct bilingual subword vocabularies. SMALA extracts subword alignments using an unsupervised state-of-the-art mapping technique. We show that joint subword vocabularies obtained with SMALA lead to higher BLEU scores on sentences that contain many false positives and false negatives.
arXiv Detail & Related papers (2021-09-09T20:46:27Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
Improving Multilingual Models with Language-Clustered Vocabularies [8.587129426070979]
We introduce a novel procedure for multilingual vocabulary generation that combines the separately trained vocabularies of several automatically derived language clusters. Our experiments show improvements across languages on key multilingual benchmark tasks.
arXiv Detail & Related papers (2020-10-24T04:49:15Z)
On the Importance of Word Order Information in Cross-lingual Sequence Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages. We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations. Our approach significantly improves crosslingual sentence retrieval performance over all other approaches. It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.