Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective Scaffold Token Removal
- URL: http://arxiv.org/abs/2404.17808v1
- Date: Sat, 27 Apr 2024 07:12:07 GMT
- Title: Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective Scaffold Token Removal
- Authors: Haoran Lian, Yizhe Xiong, Jianwei Niu, Shasha Mo, Zhenpeng Su, Zijia Lin, Peng Liu, Hui Chen, Guiguang Ding,
- Abstract summary: We propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE.
On extensive experiments across language modeling tasks and machine translation tasks, Scaffold-BPE consistently outperforms the original BPE.
- Score: 25.406520591282366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Byte Pair Encoding (BPE) serves as a foundation method for text tokenization in the Natural Language Processing (NLP) field. Despite its wide adoption, the original BPE algorithm harbors an inherent flaw: it inadvertently introduces a frequency imbalance for tokens in the text corpus. Since BPE iteratively merges the most frequent token pair in the text corpus while keeping all tokens that have been merged in the vocabulary, it unavoidably holds tokens that primarily represent subwords of complete words and appear infrequently on their own in the text corpus. We term such tokens as Scaffold Tokens. Due to their infrequent appearance in the text corpus, Scaffold Tokens pose a learning imbalance issue for language models. To address that issue, we propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE. This novel approach ensures the exclusion of low-frequency Scaffold Tokens from the token representations for the given texts, thereby mitigating the issue of frequency imbalance and facilitating model training. On extensive experiments across language modeling tasks and machine translation tasks, Scaffold-BPE consistently outperforms the original BPE, well demonstrating its effectiveness and superiority.
Related papers
- CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - On the Effectiveness of Acoustic BPE in Decoder-Only TTS [16.013858075350054]
Discretizing speech into tokens and generating them by a decoder-only model have been a promising direction for text-to-speech (TTS) and spoken language modeling (SLM)
To shorten the sequence length of speech tokens, acoustic byte-pair encoding (BPE) has emerged in SLM that treats speech tokens from self-supervised semantic representations as characters to further compress the token sequence.
We conduct a study on various settings of acoustic BPE to explore its effectiveness in decoder-only TTS models with semantic speech tokens.
arXiv Detail & Related papers (2024-07-04T12:35:32Z) - SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [68.68025991850115]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP)
SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings.
Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z) - A cost minimization approach to fix the vocabulary size in a tokenizer for an End-to-End ASR system [10.70500939394669]
tokenization algorithms like Byte Pair Piece (BPE) and WordPiece are popular in identifying the tokens that are used in the overall training process of the speech recognition system.
We show through experiments on LibriSpeech 100 hour set that the performance of an end-to-end ASR system improves when the number of tokens are chosen carefully.
arXiv Detail & Related papers (2024-04-29T12:16:21Z) - Tokenization Is More Than Compression [15.689084780238597]
Existing tokenization approaches like Byte-Pair.
(BPE) originate from the field of data compression, and it has been suggested that BPE stems from its ability to condense text into a relatively small number of tokens.
We test the hypothesis that fewer tokens lead to better downstream performance by introducing PathPiece, a new tokenizer that segments a document's text into the minimum number of tokens for a given vocabulary.
arXiv Detail & Related papers (2024-02-28T14:52:15Z) - Byte Pair Encoding Is All You Need For Automatic Bengali Speech
Recognition [0.0]
Byte pair encoding (BPE) emerges as an effective tokenization method for tackling the out-of-vocabulary (OOV) challenge.
Recent research highlights the dependency of BPE subword tokenization's efficacy on the morphological nature of the language.
Our study empirically identifies the optimal number of BPE tokens for Bengali, a language known for its morphological complexity.
arXiv Detail & Related papers (2024-01-28T00:41:21Z) - Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction.
The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z) - Assessing Phrase Break of ESL speech with Pre-trained Language Models [6.635783609515407]
This work introduces an approach to assessing phrase break in ESL learners' speech with pre-trained language models (PLMs)
Different with traditional methods, this proposal converts speech to token sequences, and then leverages the power of PLMs.
arXiv Detail & Related papers (2022-10-28T10:06:06Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z) - Token-level Adaptive Training for Neural Machine Translation [84.69646428587548]
There exists a token imbalance phenomenon in natural language as different tokens appear with different frequencies.
vanilla NMT model usually adopts trivial equal-weighted objectives for target tokens with different frequencies.
Low-frequency tokens may carry critical semantic information that will affect the translation quality once they are neglected.
arXiv Detail & Related papers (2020-10-09T05:55:05Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.