Related papers: How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?

How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?

URL: http://arxiv.org/abs/2406.11477v2
Date: Mon, 16 Sep 2024 13:55:24 GMT
Title: How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?
Authors: Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras,
Abstract summary: Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue.
Score: 38.1823640848362
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, previous work on vocabulary expansion has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion in low-resource settings has yet to be explored. In this paper, we investigate vocabulary expansion in low-resource settings by considering embedding initialization methods and continual pre-training strategies. Through extensive experiments across typologically diverse languages, tasks and models, we establish a set of strategies to perform vocabulary expansion for faster inference, maintaining competitive downstream performance to baselines with only 30K sentences ($\sim$0.01GB text data) from the target language.

Related papers

Efficient Continual Pre-training of LLMs for Low-resource Languages [45.44796295841526]
We develop a new algorithm to select a subset of texts from a larger corpus. In search of further improvement, we design a new algorithm to select tokens to include in the LLM vocabulary.
arXiv Detail & Related papers (2024-12-13T16:13:35Z)
Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models [52.00446751692225]
We present a novel and simple yet effective method called textbfDictionary textbfInsertion textbfPrompting (textbfDIP) When providing a non-English prompt, DIP looks up a word dictionary and inserts words' English counterparts into the prompt for LLMs. It then enables better translation into English and better English model thinking steps which leads to obviously better results.
arXiv Detail & Related papers (2024-11-02T05:10:50Z)
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most? [23.83290627671739]
We propose a novel method for vocabulary adaptation using adapter modules that are trained to learn the optimal linear combination of existing embeddings. VocADT offers a flexible and scalable solution without requiring external resources or language constraints. We find that Latin-script languages and highly fragmented languages benefit the most from vocabulary adaptation.
arXiv Detail & Related papers (2024-10-12T20:45:24Z)
Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
Lens is a novel approach to enhance multilingual capabilities of large language models (LLMs) It operates by manipulating the hidden representations within the language-agnostic and language-specific subspaces from top layers of LLMs. It achieves superior results with much fewer computational resources compared to existing post-training approaches.
arXiv Detail & Related papers (2024-10-06T08:51:30Z)
Accelerating Multilingual Language Model for Excessively Tokenized Languages [3.5570874721859016]
tokenizers in large language models (LLMs) often fragment a text into character or Unicode-level tokens in non-Roman alphabetic languages. We introduce a simple yet effective framework to accelerate text generation in such languages.
arXiv Detail & Related papers (2024-01-19T12:26:57Z)
The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency. We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes. It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z)
Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training [59.571632468137075]
We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. We propose an algorithm VoCap to determine the desired vocabulary capacity of each language. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax.
arXiv Detail & Related papers (2021-09-15T14:04:16Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.