Related papers: An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks

An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks

URL: http://arxiv.org/abs/2010.02534v1
Date: Tue, 6 Oct 2020 07:20:41 GMT
Title: An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks
Authors: Kyubyong Park, Joohong Lee, Seongbo Jang, Dawoon Jung
Abstract summary: Byte Pair PE (BPE) has been considered the de facto standard tokenization method. It still remains unclear whether BPE works best across all languages and tasks. Experimental results demonstrate that a hybrid approach of morphological segmentation followed by B works best in Korean to/from English machine translation.
Score: 4.207877448862984
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Typically, tokenization is the very first step in most text processing works. As a token serves as an atomic unit that embeds the contextual information of text, how to define a token plays a decisive role in the performance of a model.Even though Byte Pair Encoding (BPE) has been considered the de facto standard tokenization method due to its simplicity and universality, it still remains unclear whether BPE works best across all languages and tasks. In this paper, we test several tokenization strategies in order to answer our primary research question, that is, "What is the best tokenization strategy for Korean NLP tasks?" Experimental results demonstrate that a hybrid approach of morphological segmentation followed by BPE works best in Korean to/from English machine translation and natural language understanding tasks such as KorNLI, KorSTS, NSMC, and PAWS-X. As an exception, for KorQuAD, the Korean extension of SQuAD, BPE segmentation turns out to be the most effective.

Related papers

Deep Exploration of Cross-Lingual Zero-Shot Generalization in Instruction Tuning [47.75550640881761]
We explore cross-lingual generalization in instruction tuning by applying it to non-English tasks. We design cross-lingual templates to mitigate discrepancies in language and instruction-format of the template between training and inference. Our experiments reveal consistent improvements through cross-lingual generalization in both English and Korean.
arXiv Detail & Related papers (2024-06-13T04:10:17Z)
Tokenization Is More Than Compression [15.689084780238597]
Existing tokenization approaches like Byte-Pair. (BPE) originate from the field of data compression, and it has been suggested that BPE stems from its ability to condense text into a relatively small number of tokens. We test the hypothesis that fewer tokens lead to better downstream performance by introducing PathPiece, a new tokenizer that segments a document's text into the minimum number of tokens for a given vocabulary.
arXiv Detail & Related papers (2024-02-28T14:52:15Z)
ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling Tasks [12.700783525558721]
ToPro method decomposes an input sentence into single tokens and applies one prompt template to each token. Our experiments on multilingual NER and POS tagging datasets demonstrate that ToPro-based fine-tuning outperforms Vanilla fine-tuning and Prompt-Tuning in zero-shot cross-lingual transfer.
arXiv Detail & Related papers (2024-01-29T21:44:27Z)
Identifying and Analyzing Task-Encoding Tokens in Large Language Models [55.03191279766383]
In this paper, we identify and analyze task-encoding tokens on whose representations the task performance depends. We show that template and stopword tokens are the most prone to be task-encoding. Our work sheds light on how large language models (LLMs) learn to perform a task from demonstrations, deepens our understanding of the varied roles different types of tokens play in LLMs, and provides insights for avoiding instability from improperly utilizing task-encoding tokens.
arXiv Detail & Related papers (2024-01-20T20:55:21Z)
Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond [66.07002187192448]
We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task. We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol. We find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens.
arXiv Detail & Related papers (2023-10-09T00:20:59Z)
VECO 2.0: Cross-lingual Language Model Pre-training with Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments. Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs. token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z)
TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning [19.682704309037653]
Masked language models (MLMs) have revolutionized the field of Natural Language Understanding. We propose TaCL (Token-aware Contrastive Learning), a novel continual pre-training approach that encourages BERT to learn an isotropic and discriminative distribution of token representations.
arXiv Detail & Related papers (2021-11-07T22:54:23Z)
KLUE: Korean Language Understanding Evaluation [43.94952771238633]
We introduce Korean Language Understanding Evaluation (KLUE) benchmark. KLUE is a collection of 8 Korean natural language understanding (NLU) tasks. We build all of the tasks from scratch from diverse source corpora while respecting copyrights.
arXiv Detail & Related papers (2021-05-20T11:40:30Z)
MC-BERT: Efficient Language Pre-Training via a Meta Controller [96.68140474547602]
Large-scale pre-training is computationally expensive. ELECTRA, an early attempt to accelerate pre-training, trains a discriminative model that predicts whether each input token was replaced by a generator. We propose a novel meta-learning framework, MC-BERT, to achieve better efficiency and effectiveness.
arXiv Detail & Related papers (2020-06-10T09:22:19Z)
Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE) We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages. We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.