Improving Tokenisation by Alternative Treatment of Spaces
- URL: http://arxiv.org/abs/2204.04058v1
- Date: Fri, 8 Apr 2022 13:22:30 GMT
- Title: Improving Tokenisation by Alternative Treatment of Spaces
- Authors: Edward Gow-Smith, Harish Tayyar Madabushi, Carolina Scarton and Aline
Villavicencio
- Abstract summary: We experiment with an alternative tokenisation approach where spaces are always treated as individual tokens.
We find that our modified algorithms lead to improved performance on downstream NLP tasks.
- Score: 7.596737214110957
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tokenisation is the first step in almost all NLP tasks, and state-of-the-art
transformer-based language models all use subword tokenisation algorithms to
process input text. Existing algorithms have problems, often producing
tokenisations of limited linguistic validity, and representing equivalent
strings differently depending on their position within a word. We hypothesise
that these problems hinder the ability of transformer-based models to handle
complex words, and suggest that these problems are a result of allowing tokens
to include spaces. We thus experiment with an alternative tokenisation approach
where spaces are always treated as individual tokens. Specifically, we apply
this modification to the BPE and Unigram algorithms. We find that our modified
algorithms lead to improved performance on downstream NLP tasks that involve
handling complex words, whilst having no detrimental effect on performance in
general natural language understanding tasks. Intrinsically, we find our
modified algorithms give more morphologically correct tokenisations, in
particular when handling prefixes. Given the results of our experiments, we
advocate for always treating spaces as individual tokens as an improved
tokenisation method.
Related papers
- Tokenization as Finite-State Transduction [24.19959327497118]
We introduce a finite-state framework which can efficiently encode all possible tokenizations of a regular language.
We show that Byte-Pair.
Match (BPE) and MaxPiece (WordPiece) fit within this framework.
An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern.
arXiv Detail & Related papers (2024-10-21T07:10:07Z) - SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP)
SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings.
Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z) - Constructing a BPE Tokenization DFA [0.0]
Many natural language processing systems operate over tokenizations of text to address the open-vocabulary problem.
We give and analyze an algorithm for the efficient construction of deterministic finite automata designed to operate directly on tokenizations produced by the popular byte pair encoding technique.
arXiv Detail & Related papers (2024-05-13T11:59:24Z) - Identifying and Analyzing Task-Encoding Tokens in Large Language Models [55.03191279766383]
In this paper, we identify and analyze task-encoding tokens on whose representations the task performance depends.
We show that template and stopword tokens are the most prone to be task-encoding.
Our work sheds light on how large language models (LLMs) learn to perform a task from demonstrations, deepens our understanding of the varied roles different types of tokens play in LLMs, and provides insights for avoiding instability from improperly utilizing task-encoding tokens.
arXiv Detail & Related papers (2024-01-20T20:55:21Z) - Word Boundary Information Isn't Useful for Encoder Language Models [8.1305024841559]
We train transformer encoders across four different training scales, and investigate several alternative approaches to including word boundary information.
We find no substantial improvements from our alternative approaches, suggesting that modifying tokenisers to remove word boundary information isn't leading to a loss of useful information.
arXiv Detail & Related papers (2024-01-15T19:21:08Z) - Analyzing Cognitive Plausibility of Subword Tokenization [9.510439539246846]
Subword tokenization has become the de-facto standard for tokenization.
We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization.
arXiv Detail & Related papers (2023-10-20T08:25:37Z) - Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy
in Mental Health and Beyond [66.07002187192448]
We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task.
We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol.
We find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens.
arXiv Detail & Related papers (2023-10-09T00:20:59Z) - Linear-Time Modeling of Linguistic Structure: An Order-Theoretic
Perspective [97.57162770792182]
Tasks that model the relation between pairs of tokens in a string are a vital part of understanding natural language.
We show that these exhaustive comparisons can be avoided, and, moreover, the complexity can be reduced to linear by casting the relation between tokens as a partial order over the string.
Our method predicts real numbers for each token in a string in parallel and sorts the tokens accordingly, resulting in total orders of the tokens in the string.
arXiv Detail & Related papers (2023-05-24T11:47:35Z) - Better Than Whitespace: Information Retrieval for Languages without
Custom Tokenizers [48.036317742487796]
We propose a new approach to tokenization for lexical matching retrieval algorithms.
We use the WordPiece tokenizer, which can be built automatically from unsupervised data.
Results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages.
arXiv Detail & Related papers (2022-10-11T14:32:46Z) - Composable Text Controls in Latent Space with ODEs [97.12426987887021]
This paper proposes a new efficient approach for composable text operations in the compact latent space of text.
By connecting pretrained LMs to the latent space through efficient adaption, we then decode the sampled vectors into desired text sequences.
Experiments show that composing those operators within our approach manages to generate or edit high-quality text.
arXiv Detail & Related papers (2022-08-01T06:51:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.