Rethinking Tokenization: Crafting Better Tokenizers for Large Language
Models
- URL: http://arxiv.org/abs/2403.00417v1
- Date: Fri, 1 Mar 2024 10:03:07 GMT
- Title: Rethinking Tokenization: Crafting Better Tokenizers for Large Language
Models
- Authors: Jinbiao Yang
- Abstract summary: Tokenization significantly influences language models(LMs)' performance.
This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types.
The Less-is-Better (LiB) model could be a new approach for LLM tokenizer.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tokenization significantly influences language models(LMs)' performance. This
paper traces the evolution of tokenizers from word-level to subword-level,
analyzing how they balance tokens and types to enhance model adaptability while
controlling complexity. Despite subword tokenizers like Byte Pair Encoding
(BPE) overcoming many word tokenizer limitations, they encounter difficulties
in handling non-Latin languages and depend heavily on extensive training data
and computational resources to grasp the nuances of multiword expressions
(MWEs). This article argues that tokenizers, more than mere technical tools,
should drawing inspiration from the cognitive science about human language
processing. This study then introduces the "Principle of Least Effort" from
cognitive science, that humans naturally seek to reduce cognitive effort, and
discusses the benefits of this principle for tokenizer development. Based on
this principle, the paper proposes that the Less-is-Better (LiB) model could be
a new approach for LLM tokenizer. The LiB model can autonomously learn an
integrated vocabulary consisting of subwords, words, and MWEs, which
effectively reduces both the numbers of tokens and types. Comparative
evaluations show that the LiB tokenizer outperforms existing word and BPE
tokenizers, presenting an innovative method for tokenizer development, and
hinting at the possibility of future cognitive science-based tokenizers being
more efficient.
Related papers
- Tokenization Is More Than Compression [14.939912120571728]
Existing tokenization approaches like Byte-Pair.
(BPE) originate from the field of data compression.
We introduce PathPiece, a new tokenizer that segments a document's text into the minimum number of tokens for a given vocabulary.
arXiv Detail & Related papers (2024-02-28T14:52:15Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Improving Language Models Meaning Understanding and Consistency by
Learning Conceptual Roles from Dictionary [65.268245109828]
Non-human-like behaviour of contemporary pre-trained language models (PLMs) is a leading cause undermining their trustworthiness.
A striking phenomenon is the generation of inconsistent predictions, which produces contradictory results.
We propose a practical approach that alleviates the inconsistent behaviour issue by improving PLM awareness.
arXiv Detail & Related papers (2023-10-24T06:15:15Z) - Tokenization with Factorized Subword Encoding [2.538209532048867]
We propose a novel tokenization method that factorizes subwords onto discrete triplets using a VQ-VAE model.
Results indicate that this method is more appropriate and robust for morphological tasks than the commonly used byte-pair encoding (BPE) tokenization algorithm.
arXiv Detail & Related papers (2023-06-13T13:27:34Z) - Tokenization Impacts Multilingual Language Modeling: Assessing
Vocabulary Allocation and Overlap Across Languages [3.716965622352967]
We propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers.
Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks.
arXiv Detail & Related papers (2023-05-26T18:06:49Z) - A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task
Learning [8.052271364177988]
Subword tokenization is a commonly used input pre-processing step in most recent NLP models.
We propose a vocabulary-free neural tokenizer by distilling segmentation information from subword tokenization.
Our tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks.
arXiv Detail & Related papers (2022-04-22T16:50:49Z) - Impact of Tokenization on Language Models: An Analysis for Turkish [2.4660652494309936]
We train tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus.
Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers.
We find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers.
arXiv Detail & Related papers (2022-04-19T12:01:46Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - ERICA: Improving Entity and Relation Understanding for Pre-trained
Language Models via Contrastive Learning [97.10875695679499]
We propose a novel contrastive learning framework named ERICA in pre-training phase to obtain a deeper understanding of the entities and their relations in text.
Experimental results demonstrate that our proposed ERICA framework achieves consistent improvements on several document-level language understanding tasks.
arXiv Detail & Related papers (2020-12-30T03:35:22Z) - Coreferential Reasoning Learning for Language Representation [88.14248323659267]
We present CorefBERT, a novel language representation model that can capture the coreferential relations in context.
The experimental results show that, compared with existing baseline models, CorefBERT can achieve significant improvements consistently on various downstream NLP tasks.
arXiv Detail & Related papers (2020-04-15T03:57:45Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.