MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression
- URL: http://arxiv.org/abs/2410.21548v2
- Date: Tue, 21 Jan 2025 19:45:18 GMT
- Title: MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression
- Authors: Noel Elias, Homa Esfahanizadeh, Kaan Kale, Sriram Vishwanath, Muriel Medard,
- Abstract summary: MultiTok is a new tokenization method inspired by universal Lempel-Ziv-Welch data compression.
We show that MultiTok achieves a comparable performance to the BERT and GPT-2 standards as both a stand-alone tokenizer and an add-on to existing tokenizers.
- Score: 5.5795785998430185
- License:
- Abstract: Large language models have drastically changed the prospects of AI by introducing technologies for more complex natural language processing. However, current methodologies to train such LLMs require extensive resources including but not limited to large amounts of data, expensive machinery, and lengthy training. To solve this problem, this paper proposes a new tokenization method inspired by universal Lempel-Ziv-Welch data compression that compresses repetitive phrases into multi-word tokens. With MultiTok as a new tokenizing tool, we show that language models are able to be trained notably more efficiently while offering a similar accuracy on more succinct and compressed training data. In fact, our results demonstrate that MultiTok achieves a comparable performance to the BERT and GPT-2 standards as both a stand-alone tokenizer and an add-on to existing tokenizers while also providing close to 2.5x faster training with more than 30% less training data.
Related papers
- Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [44.84219266082269]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.
We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z) - Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.
For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.
We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z) - Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation [70.22782550540714]
Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW.
We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW.
arXiv Detail & Related papers (2024-08-07T12:42:09Z) - Patch-Level Training for Large Language Models [69.67438563485887]
This paper introduces patch-level training for Large Language Models (LLMs)
During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch.
Following this, the model continues token-level training on the remaining training data to align with the inference mode.
arXiv Detail & Related papers (2024-07-17T15:48:39Z) - Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text.
We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length.
We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z) - LLMLingua: Compressing Prompts for Accelerated Inference of Large
Language Models [22.06402870816756]
Large language models (LLMs) have been applied in various applications due to their astonishing capabilities.
This paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity.
We show that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss.
arXiv Detail & Related papers (2023-10-09T14:10:21Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - The first step is the hardest: Pitfalls of Representing and Tokenizing
Temporal Data for Large Language Models [10.414206635385632]
Large Language Models (LLMs) have demonstrated remarkable generalization across diverse tasks.
A notable obstacle emerges when feeding numerical/temporal data into these models, such as data sourced from wearables or electronic health records.
We discuss recent works that employ LLMs for human-centric tasks such as in mobile health sensing and present a case study showing that popular LLMs tokenize temporal data incorrectly.
arXiv Detail & Related papers (2023-09-12T13:51:29Z) - Chain-of-Dictionary Prompting Elicits Translation in Large Language Models [100.47154959254937]
Large language models (LLMs) have shown surprisingly good performance in multilingual neural machine translation (MNMT)
We present a novel method, CoD, which augments LLMs with prior knowledge with the chains of multilingual dictionaries for a subset of input words to elicit translation abilities.
arXiv Detail & Related papers (2023-05-11T05:19:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.