Related papers: Large Vocabulary Size Improves Large Language Models

Large Vocabulary Size Improves Large Language Models

URL: http://arxiv.org/abs/2406.16508v1
Date: Mon, 24 Jun 2024 10:27:07 GMT
Title: Large Vocabulary Size Improves Large Language Models
Authors: Sho Takase, Ryokan Ri, Shun Kiyono, Takuya Kato,
Abstract summary: We investigate the relationship between subword vocabulary size and the performance of large language models (LLMs) Experimental results show that larger vocabulary sizes lead to better performance in LLMs. We introduce a simple method to use a new vocabulary instead of the pre-defined one.
Score: 28.83786065307658
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper empirically investigates the relationship between subword vocabulary size and the performance of large language models (LLMs) to provide insights on how to define the vocabulary size. Experimental results show that larger vocabulary sizes lead to better performance in LLMs. Moreover, we consider a continual training scenario where a pre-trained language model is trained on a different target language. We introduce a simple method to use a new vocabulary instead of the pre-defined one. We show that using the new vocabulary outperforms the model with the vocabulary used in pre-training.

Related papers

Scaling LLM Pre-training with Vocabulary Curriculum [0.0]
We introduce vocabulary curriculum learning, an approach that improves pretraining efficiency with log-linear scaling gains relative to vocabulary size. Our method alternates between entropy-guided vocabulary expansion and model optimization, enabling models to learn transferable representations across diverse tokenization granularities. Experiments on small-scale GPT models demonstrate improved scaling efficiency, reinforcing the effectiveness of dynamic tokenization.
arXiv Detail & Related papers (2025-02-25T07:18:29Z)
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most? [23.83290627671739]
We propose a novel method for vocabulary adaptation using adapter modules that are trained to learn the optimal linear combination of existing embeddings. VocADT offers a flexible and scalable solution without requiring external resources or language constraints. We find that Latin-script languages and highly fragmented languages benefit the most from vocabulary adaptation.
arXiv Detail & Related papers (2024-10-12T20:45:24Z)
How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text? [38.1823640848362]
Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue.
arXiv Detail & Related papers (2024-06-17T12:42:34Z)
The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency. We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes. It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z)
Translate to Disambiguate: Zero-shot Multilingual Word Sense Disambiguation with Pretrained Language Models [67.19567060894563]
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks. We present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT) We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance.
arXiv Detail & Related papers (2023-04-26T19:55:52Z)
Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types. Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z)
Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs. In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z)
Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training [59.571632468137075]
We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. We propose an algorithm VoCap to determine the desired vocabulary capacity of each language. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax.
arXiv Detail & Related papers (2021-09-15T14:04:16Z)
Does He Wink or Does He Nod? A Challenging Benchmark for Evaluating Word Understanding of Language Models [0.6091702876917281]
Recent progress in pretraining language models on large corpora has resulted in large performance gains on many NLP tasks. To assess what kind of knowledge is acquired, language models are commonly probed by querying them with fill in the blank' style cloze questions. We introduce WDLMPro to evaluate word understanding directly using dictionary definitions of words.
arXiv Detail & Related papers (2021-02-06T15:15:57Z)
Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size. We propose a fully compositional output embedding layer for language models. To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.