Understanding and Mitigating Tokenization Bias in Language Models
- URL: http://arxiv.org/abs/2406.16829v2
- Date: Fri, 5 Jul 2024 21:49:08 GMT
- Title: Understanding and Mitigating Tokenization Bias in Language Models
- Authors: Buu Phan, Marton Havasi, Matthew Muckley, Karen Ullrich,
- Abstract summary: State-of-the-art language models are autoregressive and operate on subword units known as tokens.
We show that popular encoding schemes induce a sampling bias that cannot be mitigated with more training or data.
We propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data.
- Score: 6.418593476658017
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction. We show that popular encoding schemes, such as maximum prefix encoding (MPE) and byte-pair-encoding (BPE), induce a sampling bias that cannot be mitigated with more training or data. To counter this universal problem, for each encoding scheme above, we propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data. Our methods do not require finetuning the model, and the complexity, defined as the number of model runs, scales linearly with the sequence length in the case of MPE. As a result, we show that one can simulate token-free behavior from a tokenized language model. We empirically verify the correctness of our method through a Markov-chain setup, where it accurately recovers the transition probabilities, as opposed to the conventional method of directly prompting tokens into the language model.
Related papers
- Tokenization as Finite-State Transduction [24.19959327497118]
We introduce a finite-state framework which can efficiently encode all possible tokenizations of a regular language.
We show that Byte-Pair.
Match (BPE) and MaxPiece (WordPiece) fit within this framework.
An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern.
arXiv Detail & Related papers (2024-10-21T07:10:07Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models [40.992566245706996]
We propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens.
We train generative language models at different scales of 468M, 1.2B, and 6.7B parameters.
Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.
arXiv Detail & Related papers (2023-10-30T13:33:21Z) - Thutmose Tagger: Single-pass neural model for Inverse Text Normalization [76.87664008338317]
Inverse text normalization (ITN) is an essential post-processing step in automatic speech recognition.
We present a dataset preparation method based on the granular alignment of ITN examples.
One-to-one correspondence between tags and input words improves the interpretability of the model's predictions.
arXiv Detail & Related papers (2022-07-29T20:39:02Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - You should evaluate your language model on marginal likelihood
overtokenisations [5.824498637088864]
We argue that language models should be evaluated on their marginal likelihood over tokenisations.
We evaluate pretrained English and German language models on both the one-best-tokenisation and marginal perplexities.
arXiv Detail & Related papers (2021-09-06T15:37:02Z) - Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word
Alignment [49.45399359826453]
Cross-lingual language models are typically pretrained with language modeling on multilingual text or parallel sentences.
We introduce denoising word alignment as a new cross-lingual pre-training task.
Experimental results show that our method improves cross-lingual transferability on various datasets.
arXiv Detail & Related papers (2021-06-11T13:36:01Z) - CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
Representation [12.005340904206697]
CANINE is a neural encoder that operates directly on character sequences without explicit tokenization or vocabulary.
CanINE outperforms a comparable mBERT model by >= 1 F1 on TyDi QA, a challenging multilingual benchmark.
arXiv Detail & Related papers (2021-03-11T18:57:44Z) - COCO-LM: Correcting and Contrasting Text Sequences for Language Model
Pretraining [59.169836983883656]
COCO-LM is a new self-supervised learning framework that pretrains Language Models by COrrecting challenging errors and COntrasting text sequences.
COCO-LM employs an auxiliary language model to mask-and-predict tokens in original text sequences.
Our analyses reveal that COCO-LM's advantages come from its challenging training signals, more contextualized token representations, and regularized sequence representations.
arXiv Detail & Related papers (2021-02-16T22:24:29Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.