Toucan: Token-Aware Character Level Language Modeling
- URL: http://arxiv.org/abs/2311.08620v1
- Date: Wed, 15 Nov 2023 00:57:51 GMT
- Title: Toucan: Token-Aware Character Level Language Modeling
- Authors: William Fleshman and Benjamin Van Durme
- Abstract summary: Toucan is an augmentation to character-level models to make them "token-aware"
We show significant speed-ups in character generation without a loss in language modeling performance.
Our approach leads to a greater amount of longer sequences tokenized as single items.
- Score: 44.85590844938571
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Character-level language models obviate the need for separately trained
tokenizers, but efficiency suffers from longer sequence lengths. Learning to
combine character representations into tokens has made training these models
more efficient, but they still require decoding characters individually. We
propose Toucan, an augmentation to character-level models to make them
"token-aware". Comparing our method to prior work, we demonstrate significant
speed-ups in character generation without a loss in language modeling
performance. We then explore differences between our learned dynamic
tokenization of character sequences with popular fixed vocabulary solutions
such as Byte-Pair Encoding and WordPiece, finding our approach leads to a
greater amount of longer sequences tokenized as single items. Our project and
code are available at https://nlp.jhu.edu/nuggets/.
Related papers
- LBPE: Long-token-first Tokenization to Improve Large Language Models [26.3619552256488]
Long tokens, rich in semantic information, have fewer occurrences in tokenized datasets compared to short tokens.
We propose LBPE, which prioritizes long tokens during the encoding process.
Experiments across diverse language modeling tasks demonstrate that LBPE consistently outperforms the original BPE.
arXiv Detail & Related papers (2024-11-08T12:03:36Z) - LoTLIP: Improving Language-Image Pre-training for Long Text Understanding [71.04947115945349]
We relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text.
We then help the model catch up to its original level of short text understanding yet greatly enhance its capability of long text understanding.
Our method demonstrates superior performance in long-text-image retrieval tasks.
arXiv Detail & Related papers (2024-10-07T17:52:56Z) - Learning Mutually Informed Representations for Characters and Subwords [26.189422354038978]
We introduce the entanglement model, aiming to combine character and subword language models.
Inspired by vision-language models, our model treats characters and subwords as separate modalities.
We evaluate our model on text classification, named entity recognition, POS-tagging, and character-level sequence labeling.
arXiv Detail & Related papers (2023-11-14T02:09:10Z) - Learn Your Tokens: Word-Pooled Tokenization for Language Modeling [11.40976202290724]
Language models typically tokenize text into subwords, using a deterministic, hand-engineered of combining tokens into longer strings.
Recent attempts to compress and limit context lengths with fixed size convolutions is helpful but completely ignores the word boundary.
This paper considers an alternative 'learn your word' scheme which utilizes the word boundary to pool bytes/characters into word representations.
arXiv Detail & Related papers (2023-10-17T23:34:39Z) - Language Model Tokenizers Introduce Unfairness Between Languages [98.92630681729518]
We show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked.
Character-level and byte-level models also exhibit over 4 times the difference in the encoding length for some language pairs.
We make the case that we should train future language models using multilingually fair subword tokenizers.
arXiv Detail & Related papers (2023-05-17T14:17:57Z) - What do tokens know about their characters and how do they know it? [3.8254443661593633]
We show that pre-trained language models that use subword tokenization schemes can succeed at a variety of language tasks that require character-level information.
We show that these models robustly encode character-level information and, in general, larger models perform better at the task.
arXiv Detail & Related papers (2022-06-06T13:27:26Z) - Models In a Spelling Bee: Language Models Implicitly Learn the Character
Composition of Tokens [22.55706811131828]
We probe the embedding layer of pretrained language models.
We show that models learn the internal character composition of whole word and subword tokens.
arXiv Detail & Related papers (2021-08-25T11:48:05Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.