Byte BPE Tokenization as an Inverse string Homomorphism
- URL: http://arxiv.org/abs/2412.03160v1
- Date: Wed, 04 Dec 2024 09:38:11 GMT
- Title: Byte BPE Tokenization as an Inverse string Homomorphism
- Authors: Saibo Geng, Sankalp Gambhir, Chris Wendler, Robert West,
- Abstract summary: We show that tokenization acts as an inverse homomorphism between strings and tokens.
This suggests that the character space of the source language and the token space of the tokenized language are homomorphic.
We also explore the concept of proper tokenization, which refers to an unambiguous tokenization returned from the tokenizer.
- Score: 12.885921620444272
- License:
- Abstract: Tokenization is an important preprocessing step in the training and inference of large language models (LLMs). While there has been extensive research on the expressive power of the neural achitectures used in LLMs, the impact of tokenization has not been well understood. In this work, we demonstrate that tokenization, irrespective of the algorithm used, acts as an inverse homomorphism between strings and tokens. This suggests that the character space of the source language and the token space of the tokenized language are homomorphic, preserving the structural properties of the source language. Additionally, we explore the concept of proper tokenization, which refers to an unambiguous tokenization returned from the tokenizer. Our analysis reveals that the expressiveness of neural architectures in recognizing context-free languages is not affected by tokenization.
Related papers
- On the Proper Treatment of Tokenization in Psycholinguistics [53.960910019072436]
The paper argues that token-level language models should be marginalized into character-level language models before they are used in psycholinguistic studies.
We find various focal areas whose surprisal is a better psychometric predictor than the surprisal of the region of interest itself.
arXiv Detail & Related papers (2024-10-03T17:18:03Z) - Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement [5.223020867766102]
We investigate how different tokenization schemes impact number agreement in Spanish plurals.
We find that morphologically-aligned tokenization performs similarly to other tokenization schemes.
Our results indicate that morphological tokenization is not strictly required for performance.
arXiv Detail & Related papers (2024-03-20T17:01:56Z) - Rethinking Tokenization: Crafting Better Tokenizers for Large Language
Models [0.0]
Tokenization significantly influences language models(LMs)' performance.
This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types.
The Less-is-Better (LiB) model could be a new approach for LLM tokenizer.
arXiv Detail & Related papers (2024-03-01T10:03:07Z) - How Important Is Tokenization in French Medical Masked Language Models? [7.866517623371908]
Subword tokenization has become the prevailing standard in the field of natural language processing (NLP)
This paper seeks to delve into the complexities of subword tokenization in French biomedical domain across a variety of NLP tasks.
We introduce an original tokenization strategy that integrates morpheme-enriched word segmentation into existing tokenization methods.
arXiv Detail & Related papers (2024-02-22T23:11:08Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Analyzing Cognitive Plausibility of Subword Tokenization [9.510439539246846]
Subword tokenization has become the de-facto standard for tokenization.
We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization.
arXiv Detail & Related papers (2023-10-20T08:25:37Z) - mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view
Contrastive Learning [54.523172171533645]
Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora.
We propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER)
Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches.
arXiv Detail & Related papers (2023-08-17T16:02:29Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - Better Than Whitespace: Information Retrieval for Languages without
Custom Tokenizers [48.036317742487796]
We propose a new approach to tokenization for lexical matching retrieval algorithms.
We use the WordPiece tokenizer, which can be built automatically from unsupervised data.
Results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages.
arXiv Detail & Related papers (2022-10-11T14:32:46Z) - Unsupervised Distillation of Syntactic Information from Contextualized
Word Representations [62.230491683411536]
We tackle the task of unsupervised disentanglement between semantics and structure in neural language representations.
To this end, we automatically generate groups of sentences which are structurally similar but semantically different.
We demonstrate that our transformation clusters vectors in space by structural properties, rather than by lexical semantics.
arXiv Detail & Related papers (2020-10-11T15:13:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.