Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models
- URL: http://arxiv.org/abs/2410.03258v1
- Date: Fri, 4 Oct 2024 09:24:55 GMT
- Title: Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models
- Authors: Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly,
- Abstract summary: We show a fundamental limitation in vocabulary adaptation approaches that use Byte-Pair.
(BPE) tokenization scheme for fine-tuning pretrained language models (PLMs) to expert domains.
We propose AdaptBPE where the BPE tokenization phase is modified to first perform the longest string matching on the added (target) vocabulary before tokenizing at the character level.
- Score: 26.442558912559658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we show a fundamental limitation in vocabulary adaptation approaches that use Byte-Pair Encoding (BPE) tokenization scheme for fine-tuning pretrained language models (PLMs) to expert domains. Current approaches trivially append the target domain-specific vocabulary at the end of the PLM vocabulary. This approach leads to a lower priority score and causes sub-optimal tokenization in BPE that iteratively uses merge rules to tokenize a given text. To mitigate this issue, we propose AdaptBPE where the BPE tokenization initialization phase is modified to first perform the longest string matching on the added (target) vocabulary before tokenizing at the character level. We perform an extensive evaluation of AdaptBPE versus the standard BPE over various classification and summarization tasks; AdaptBPE improves by 3.57% (in terms of accuracy) and 1.87% (in terms of Rouge-L), respectively. AdaptBPE for MEDVOC works particularly well when reference summaries have high OOV concentration or are longer in length. We also conduct a human evaluation, revealing that AdaptBPE generates more relevant and more faithful summaries as compared to MEDVOC. We make our codebase publicly available at https://github.com/gb-kgp/adaptbpe.
Related papers
- LBPE: Long-token-first Tokenization to Improve Large Language Models [26.3619552256488]
Long tokens, rich in semantic information, have fewer occurrences in tokenized datasets compared to short tokens.
We propose LBPE, which prioritizes long tokens during the encoding process.
Experiments across diverse language modeling tasks demonstrate that LBPE consistently outperforms the original BPE.
arXiv Detail & Related papers (2024-11-08T12:03:36Z) - BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training [8.012203293561196]
Picky BPE is a modified BPE algorithm that carries out vocabulary refinement during tokenizer training.
Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression.
arXiv Detail & Related papers (2024-09-06T20:12:34Z) - Batching BPE Tokenization Merges [55.2480439325792]
BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm.
It is used to train a high quality tokenizer on a basic laptop.
arXiv Detail & Related papers (2024-08-05T09:37:21Z) - MEDVOC: Vocabulary Adaptation for Fine-tuning Pre-trained Language Models on Medical Text Summarization [26.442558912559658]
This work presents a dynamic vocabulary adaptation strategy, MEDVOC, for fine-tuning pre-trained language models (PLMs)
In contrast to existing domain adaptation approaches in summarization, MEDVOC treats vocabulary as an optimizable parameter.
Our human evaluation shows MEDVOC generates more faithful medical summaries.
arXiv Detail & Related papers (2024-05-07T10:00:00Z) - Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal [58.29382184006158]
We propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE method.
On extensive experiments across language modeling and even machine translation, Scaffold-BPE consistently outperforms the original BPE.
arXiv Detail & Related papers (2024-04-27T07:12:07Z) - Tokenization Is More Than Compression [14.939912120571728]
Existing tokenization approaches like Byte-Pair.
(BPE) originate from the field of data compression.
We introduce PathPiece, a new tokenizer that segments a document's text into the minimum number of tokens for a given vocabulary.
arXiv Detail & Related papers (2024-02-28T14:52:15Z) - OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining [49.213120730582354]
Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining.
We propose a novel framework: $textbfO$ne $textbfF$or $textbfA$ll, which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively.
arXiv Detail & Related papers (2023-11-15T10:40:45Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - Dynamic Programming Encoding for Subword Segmentation in Neural Machine
Translation [80.38621085548013]
This paper introduces Dynamic Programming (DPE) a new segmentation algorithm for tokenizing sentences into subword units.
A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations.
arXiv Detail & Related papers (2020-05-03T05:00:50Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.