Related papers: How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese

How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese

URL: http://arxiv.org/abs/2306.09572v1
Date: Fri, 16 Jun 2023 01:22:32 GMT
Title: How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese
Authors: Takuro Fujii, Koki Shibata, Atsuki Yamaguchi, Terufumi Morishita, Yasuhiro Sogawa
Abstract summary: This paper investigates the effect of tokenizers on the downstream performance of pretrained language models (PLMs) in scriptio continua languages where no explicit spaces exist between words. The tokenizer for such languages often consists of a morphological analyzer and a subword tokenizer, requiring us to conduct a comprehensive study of all possible pairs. We train extensive sets of tokenizers, build a PLM using each, and measure the downstream performance on a wide range of tasks.
Score: 4.259342268820457
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper investigates the effect of tokenizers on the downstream performance of pretrained language models (PLMs) in scriptio continua languages where no explicit spaces exist between words, using Japanese as a case study. The tokenizer for such languages often consists of a morphological analyzer and a subword tokenizer, requiring us to conduct a comprehensive study of all possible pairs. However, previous studies lack this comprehensiveness. We therefore train extensive sets of tokenizers, build a PLM using each, and measure the downstream performance on a wide range of tasks. Our results demonstrate that each downstream task has a different optimal morphological analyzer, and that it is better to use Byte-Pair-Encoding or Unigram rather than WordPiece as a subword tokenizer, regardless of the type of task.

Related papers

Tokenization is Sensitive to Language Variation [14.568179478275255]
Tokenizers split texts into smaller units and might behave differently for less common linguistic forms. This might affect downstream LLM performance differently on two types of tasks. We investigate how key algorithmic design choices impact downstream models' performances.
arXiv Detail & Related papers (2025-02-21T09:58:54Z)
Egalitarian Language Representation in Language Models: It All Begins with Tokenizers [0.0]
We show that not all tokenizers offer fair representation for complex script languages such as Tamil, Sinhala, and Hindi. We introduce an improvement to the Byte Pair algorithm by incorporating graphemes, which we term Grapheme Pair. Our experiments show that grapheme-based character extraction outperforms byte-level tokenizers for complex scripts.
arXiv Detail & Related papers (2024-09-17T19:05:37Z)
Identifying and Analyzing Task-Encoding Tokens in Large Language Models [55.03191279766383]
In this paper, we identify and analyze task-encoding tokens on whose representations the task performance depends. We show that template and stopword tokens are the most prone to be task-encoding. Our work sheds light on how large language models (LLMs) learn to perform a task from demonstrations, deepens our understanding of the varied roles different types of tokens play in LLMs, and provides insights for avoiding instability from improperly utilizing task-encoding tokens.
arXiv Detail & Related papers (2024-01-20T20:55:21Z)
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling [11.40976202290724]
Language models typically tokenize text into subwords, using a deterministic, hand-engineered of combining tokens into longer strings. Recent attempts to compress and limit context lengths with fixed size convolutions is helpful but completely ignores the word boundary. This paper considers an alternative 'learn your word' scheme which utilizes the word boundary to pool bytes/characters into word representations.
arXiv Detail & Related papers (2023-10-17T23:34:39Z)
Decomposed Prompting for Machine Translation Between Related Languages using Large Language Models [55.35106713257871]
We introduce DecoMT, a novel approach of few-shot prompting that decomposes the translation process into a sequence of word chunk translations. We show that DecoMT outperforms the strong few-shot prompting BLOOM model with an average improvement of 8 chrF++ scores across the examined languages.
arXiv Detail & Related papers (2023-05-22T14:52:47Z)
Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers [48.036317742487796]
We propose a new approach to tokenization for lexical matching retrieval algorithms. We use the WordPiece tokenizer, which can be built automatically from unsupervised data. Results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages.
arXiv Detail & Related papers (2022-10-11T14:32:46Z)
How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training? [86.48323488619629]
We analyze how translation performance changes as the data ratios among languages vary in the tokenizer training corpus. We find that while relatively better performance is often observed when languages are more equally sampled, the downstream performance is more robust to language imbalance than we usually expected.
arXiv Detail & Related papers (2022-04-29T17:50:36Z)
A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning [8.052271364177988]
Subword tokenization is a commonly used input pre-processing step in most recent NLP models. We propose a vocabulary-free neural tokenizer by distilling segmentation information from subword tokenization. Our tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks.
arXiv Detail & Related papers (2022-04-22T16:50:49Z)
Breaking Character: Are Subwords Good Enough for MRLs After All? [36.11778282905458]
We pretraining a BERT-style language model over character sequences instead of word-pieces. We compare the resulting model, dubbed TavBERT, against contemporary PLMs based on subwords for three highly complex and ambiguous MRLs. Our results show, for all tested languages, that while TavBERT obtains mild improvements on surface-level tasks, subword-based PLMs achieve significantly higher performance on semantic tasks.
arXiv Detail & Related papers (2022-04-10T18:54:43Z)
More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ. We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z)
Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages. We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data. Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z)
How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation [82.96358326053115]
We investigate sensitivity of probing task results to structural design choices. We probe embeddings in a multilingual setup with design choices that lie in a'stable region', as we identify for English. We find that results on English do not transfer to other languages.
arXiv Detail & Related papers (2020-06-16T12:37:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.