Analyzing Cognitive Plausibility of Subword Tokenization
- URL:
- Date: Fri, 20 Oct 2023 08:25:37 GMT
- Title: Analyzing Cognitive Plausibility of Subword Tokenization
- Authors: Lisa Beinborn and Yuval Pinter
- Abstract summary: Subword tokenization has become the de-facto standard for tokenization.
We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization.
- Score: 9.510439539246846
- License:
- Abstract: Subword tokenization has become the de-facto standard for tokenization,
although comparative evaluations of subword vocabulary quality across languages
are scarce. Existing evaluation studies focus on the effect of a tokenization
algorithm on the performance in downstream tasks, or on engineering criteria
such as the compression rate. We present a new evaluation paradigm that focuses
on the cognitive plausibility of subword tokenization. We analyze the
correlation of the tokenizer output with the response time and accuracy of
human performance on a lexical decision task. We compare three tokenization
algorithms across several languages and vocabulary sizes. Our results indicate
that the UnigramLM algorithm yields less cognitively plausible tokenization
behavior and a worse coverage of derivational morphemes, in contrast with prior
Related papers
- Byte BPE Tokenization as an Inverse string Homomorphism [12.885921620444272]
We show that tokenization acts as an inverse homomorphism between strings and tokens.
This suggests that the character space of the source language and the token space of the tokenized language are homomorphic.
We also explore the concept of proper tokenization, which refers to an unambiguous tokenization returned from the tokenizer.
arXiv Detail & Related papers (2024-12-04T09:38:11Z) - STAB: Speech Tokenizer Assessment Benchmark [57.45234921100835]
Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text.
We present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively.
We evaluate the STAB metrics and correlate this with downstream task performance across a range of speech tasks and tokenizer choices.
arXiv Detail & Related papers (2024-09-04T02:20:59Z) - Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge [10.721272718226848]
We propose a combined intrinsic-extrinsic evaluation framework for subword tokenization.
Intrepid evaluation is based on our new UniMorph Labeller tool that classifies subword tokenization as either morphological or alien.
Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that alien tokenization leads to poorer generalizations.
arXiv Detail & Related papers (2024-04-20T06:49:15Z) - Revisiting subword tokenization: A case study on affixal negation in large language models [57.75279238091522]
We measure the impact of affixal negation on modern English large language models (LLMs)
We conduct experiments using LLMs with different subword tokenization methods.
We show that models can, on the whole, reliably recognize the meaning of affixal negation.
arXiv Detail & Related papers (2024-04-03T03:14:27Z) - Rethinking Tokenization: Crafting Better Tokenizers for Large Language
Models [0.0]
Tokenization significantly influences language models(LMs)' performance.
This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types.
The Less-is-Better (LiB) model could be a new approach for LLM tokenizer.
arXiv Detail & Related papers (2024-03-01T10:03:07Z) - Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy
in Mental Health and Beyond [66.07002187192448]
We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task.
We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol.
We find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens.
arXiv Detail & Related papers (2023-10-09T00:20:59Z) - Towards Unsupervised Recognition of Token-level Semantic Differences in
Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task.
We study three unsupervised approaches that rely on a masked language model.
Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z) - Towards preserving word order importance through Forced Invalidation [80.33036864442182]
We show that pre-trained language models are insensitive to word order.
We propose Forced Invalidation to help preserve the importance of word order.
Our experiments demonstrate that Forced Invalidation significantly improves the sensitivity of the models to word order.
arXiv Detail & Related papers (2023-04-11T13:42:10Z) - Better Than Whitespace: Information Retrieval for Languages without
Custom Tokenizers [48.036317742487796]
We propose a new approach to tokenization for lexical matching retrieval algorithms.
We use the WordPiece tokenizer, which can be built automatically from unsupervised data.
Results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages.
arXiv Detail & Related papers (2022-10-11T14:32:46Z) - A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task
Learning [8.052271364177988]
Subword tokenization is a commonly used input pre-processing step in most recent NLP models.
We propose a vocabulary-free neural tokenizer by distilling segmentation information from subword tokenization.
Our tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks.
arXiv Detail & Related papers (2022-04-22T16:50:49Z) - Improving Tokenisation by Alternative Treatment of Spaces [7.596737214110957]
We experiment with an alternative tokenisation approach where spaces are always treated as individual tokens.
We find that our modified algorithms lead to improved performance on downstream NLP tasks.
arXiv Detail & Related papers (2022-04-08T13:22:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.