Byte BPE Tokenization as an Inverse string Homomorphism
- URL: http://arxiv.org/abs/2412.03160v1
- Date: Wed, 04 Dec 2024 09:38:11 GMT
- Title: Byte BPE Tokenization as an Inverse string Homomorphism
- Authors: Saibo Geng, Sankalp Gambhir, Chris Wendler, Robert West,
- Abstract summary: We show that tokenization acts as an inverse homomorphism between strings and tokens.<n>This suggests that the character space of the source language and the token space of the tokenized language are homomorphic.<n>We also explore the concept of proper tokenization, which refers to an unambiguous tokenization returned from the tokenizer.
- Score: 12.885921620444272
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tokenization is an important preprocessing step in the training and inference of large language models (LLMs). While there has been extensive research on the expressive power of the neural achitectures used in LLMs, the impact of tokenization has not been well understood. In this work, we demonstrate that tokenization, irrespective of the algorithm used, acts as an inverse homomorphism between strings and tokens. This suggests that the character space of the source language and the token space of the tokenized language are homomorphic, preserving the structural properties of the source language. Additionally, we explore the concept of proper tokenization, which refers to an unambiguous tokenization returned from the tokenizer. Our analysis reveals that the expressiveness of neural architectures in recognizing context-free languages is not affected by tokenization.
Related papers
- Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning [31.632816425798108]
Tokenization is a necessary component within the current architecture of many language models.
We discuss how tokens and pretraining can act as a backdoor for bias and other unwanted content.
We relay evidence that the tokenization algorithm's objective function impacts the large language model's cognition.
arXiv Detail & Related papers (2024-12-14T18:18:52Z) - Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs.
We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z) - Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement [5.223020867766102]
We investigate how different tokenization schemes impact number agreement in Spanish plurals.
We find that morphologically-aligned tokenization performs similarly to other tokenization schemes.
Our results indicate that morphological tokenization is not strictly required for performance.
arXiv Detail & Related papers (2024-03-20T17:01:56Z) - How Important Is Tokenization in French Medical Masked Language Models? [7.866517623371908]
Subword tokenization has become the prevailing standard in the field of natural language processing (NLP)
This paper seeks to delve into the complexities of subword tokenization in French biomedical domain across a variety of NLP tasks.
We introduce an original tokenization strategy that integrates morpheme-enriched word segmentation into existing tokenization methods.
arXiv Detail & Related papers (2024-02-22T23:11:08Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Identifying and Analyzing Performance-Critical Tokens in Large Language Models [52.404072802235234]
We study how large language models learn to perform tasks from demonstrations.
Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.
arXiv Detail & Related papers (2024-01-20T20:55:21Z) - Analyzing Cognitive Plausibility of Subword Tokenization [9.510439539246846]
Subword tokenization has become the de-facto standard for tokenization.
We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization.
arXiv Detail & Related papers (2023-10-20T08:25:37Z) - mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view
Contrastive Learning [54.523172171533645]
Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora.
We propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER)
Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches.
arXiv Detail & Related papers (2023-08-17T16:02:29Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - Better Than Whitespace: Information Retrieval for Languages without
Custom Tokenizers [48.036317742487796]
We propose a new approach to tokenization for lexical matching retrieval algorithms.
We use the WordPiece tokenizer, which can be built automatically from unsupervised data.
Results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages.
arXiv Detail & Related papers (2022-10-11T14:32:46Z) - Unsupervised Distillation of Syntactic Information from Contextualized
Word Representations [62.230491683411536]
We tackle the task of unsupervised disentanglement between semantics and structure in neural language representations.
To this end, we automatically generate groups of sentences which are structurally similar but semantically different.
We demonstrate that our transformation clusters vectors in space by structural properties, rather than by lexical semantics.
arXiv Detail & Related papers (2020-10-11T15:13:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.