From Language Models over Tokens to Language Models over Characters
- URL: http://arxiv.org/abs/2412.03719v1
- Date: Wed, 04 Dec 2024 21:19:20 GMT
- Title: From Language Models over Tokens to Language Models over Characters
- Authors: Tim Vieira, Ben LeBrun, Mario Giulianelli, Juan Luis Gastaldi, Brian DuSell, John Terilla, Timothy J. O'Donnell, Ryan Cotterell,
- Abstract summary: Modern language models are internally -- and mathematically -- distributions over token strings rather than emphcharacter strings.
This paper presents algorithms for converting token-level language models to character-level ones.
- Score: 54.123846188068384
- License:
- Abstract: Modern language models are internally -- and mathematically -- distributions over token strings rather than \emph{character} strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent analyses are very sensitive to the specification of the prompt (e.g., if the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. We find that -- even with a small computation budget -- our method is able to accurately approximate the character-level distribution (less than 0.00021 excess bits / character) at reasonably fast speeds (46.3 characters / second) on the Llama 3.1 8B language model.
Related papers
- Retrofitting Large Language Models with Dynamic Tokenization [3.608780819053423]
We propose retrofitting current language models with dynamic tokenization.
We merge frequent subword sequences in a batch, then apply a pre-trained embedding-prediction hypernetwork to compute the token embeddings on-the-fly.
We find that dynamic tokenization can mitigate the limitations of static tokenization by substantially improving inference speed and promoting fairness across languages.
arXiv Detail & Related papers (2024-11-27T17:51:58Z) - Understanding and Mitigating Tokenization Bias in Language Models [6.418593476658017]
State-of-the-art language models are autoregressive and operate on subword units known as tokens.
We show that popular encoding schemes induce a sampling bias that cannot be mitigated with more training or data.
We propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data.
arXiv Detail & Related papers (2024-06-24T17:38:02Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Toucan: Token-Aware Character Level Language Modeling [44.85590844938571]
Toucan is an augmentation to character-level models to make them "token-aware"
We show significant speed-ups in character generation without a loss in language modeling performance.
Our approach leads to a greater amount of longer sequences tokenized as single items.
arXiv Detail & Related papers (2023-11-15T00:57:51Z) - Learn Your Tokens: Word-Pooled Tokenization for Language Modeling [11.40976202290724]
Language models typically tokenize text into subwords, using a deterministic, hand-engineered of combining tokens into longer strings.
Recent attempts to compress and limit context lengths with fixed size convolutions is helpful but completely ignores the word boundary.
This paper considers an alternative 'learn your word' scheme which utilizes the word boundary to pool bytes/characters into word representations.
arXiv Detail & Related papers (2023-10-17T23:34:39Z) - Lexinvariant Language Models [84.2829117441298]
Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM)
We study textitlexinvariantlanguage models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice.
We show that a lexinvariant LM can attain perplexity comparable to that of a standard language model, given a sufficiently long context.
arXiv Detail & Related papers (2023-05-24T19:10:46Z) - Better Than Whitespace: Information Retrieval for Languages without
Custom Tokenizers [48.036317742487796]
We propose a new approach to tokenization for lexical matching retrieval algorithms.
We use the WordPiece tokenizer, which can be built automatically from unsupervised data.
Results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages.
arXiv Detail & Related papers (2022-10-11T14:32:46Z) - What do tokens know about their characters and how do they know it? [3.8254443661593633]
We show that pre-trained language models that use subword tokenization schemes can succeed at a variety of language tasks that require character-level information.
We show that these models robustly encode character-level information and, in general, larger models perform better at the task.
arXiv Detail & Related papers (2022-06-06T13:27:26Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z) - 2kenize: Tying Subword Sequences for Chinese Script Conversion [54.33749520569979]
We propose a model that can disambiguate between mappings and convert between the two scripts.
Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
arXiv Detail & Related papers (2020-05-07T10:53:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.