Toucan: Token-Aware Character Level Language Modeling
- URL: http://arxiv.org/abs/2311.08620v1
- Date: Wed, 15 Nov 2023 00:57:51 GMT
- Title: Toucan: Token-Aware Character Level Language Modeling
- Authors: William Fleshman and Benjamin Van Durme
- Abstract summary: Toucan is an augmentation to character-level models to make them "token-aware"
We show significant speed-ups in character generation without a loss in language modeling performance.
Our approach leads to a greater amount of longer sequences tokenized as single items.
- Score: 44.85590844938571
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Character-level language models obviate the need for separately trained
tokenizers, but efficiency suffers from longer sequence lengths. Learning to
combine character representations into tokens has made training these models
more efficient, but they still require decoding characters individually. We
propose Toucan, an augmentation to character-level models to make them
"token-aware". Comparing our method to prior work, we demonstrate significant
speed-ups in character generation without a loss in language modeling
performance. We then explore differences between our learned dynamic
tokenization of character sequences with popular fixed vocabulary solutions
such as Byte-Pair Encoding and WordPiece, finding our approach leads to a
greater amount of longer sequences tokenized as single items. Our project and
code are available at https://nlp.jhu.edu/nuggets/.
Related papers
- Learning Mutually Informed Representations for Characters and Subwords [26.189422354038978]
We introduce the entanglement model, aiming to combine character and subword language models.
Inspired by vision-language models, our model treats characters and subwords as separate modalities.
We evaluate our model on text classification, named entity recognition, POS-tagging, and character-level sequence labeling.
arXiv Detail & Related papers (2023-11-14T02:09:10Z) - Understanding the Role of Input Token Characters in Language Models: How
Does Information Loss Affect Performance? [45.53600782873268]
We study how information loss in input token characters affects the performance of pre-training language models.
Surprisingly, we find that pre-training even under extreme settings, i.e. using only one character of each token, the performance retention in standard NLU benchmarks and probing tasks is high.
For instance, a model pre-trained only on single first characters from tokens achieves performance retention of approximately $90$% and $77$% of the full-token model in SuperGLUE and GLUE tasks, respectively.
arXiv Detail & Related papers (2023-10-26T09:47:50Z) - Learn Your Tokens: Word-Pooled Tokenization for Language Modeling [11.40976202290724]
Language models typically tokenize text into subwords, using a deterministic, hand-engineered of combining tokens into longer strings.
Recent attempts to compress and limit context lengths with fixed size convolutions is helpful but completely ignores the word boundary.
This paper considers an alternative 'learn your word' scheme which utilizes the word boundary to pool bytes/characters into word representations.
arXiv Detail & Related papers (2023-10-17T23:34:39Z) - Language Model Tokenizers Introduce Unfairness Between Languages [98.92630681729518]
We show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked.
Character-level and byte-level models also exhibit over 4 times the difference in the encoding length for some language pairs.
We make the case that we should train future language models using multilingually fair subword tokenizers.
arXiv Detail & Related papers (2023-05-17T14:17:57Z) - What do tokens know about their characters and how do they know it? [3.8254443661593633]
We show that pre-trained language models that use subword tokenization schemes can succeed at a variety of language tasks that require character-level information.
We show that these models robustly encode character-level information and, in general, larger models perform better at the task.
arXiv Detail & Related papers (2022-06-06T13:27:26Z) - Models In a Spelling Bee: Language Models Implicitly Learn the Character
Composition of Tokens [22.55706811131828]
We probe the embedding layer of pretrained language models.
We show that models learn the internal character composition of whole word and subword tokens.
arXiv Detail & Related papers (2021-08-25T11:48:05Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z) - Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word
Alignment [49.45399359826453]
Cross-lingual language models are typically pretrained with language modeling on multilingual text or parallel sentences.
We introduce denoising word alignment as a new cross-lingual pre-training task.
Experimental results show that our method improves cross-lingual transferability on various datasets.
arXiv Detail & Related papers (2021-06-11T13:36:01Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.