Assessing the Importance of Frequency versus Compositionality for
Subword-based Tokenization in NMT
- URL: http://arxiv.org/abs/2306.01393v3
- Date: Fri, 12 Jan 2024 12:21:27 GMT
- Title: Assessing the Importance of Frequency versus Compositionality for
Subword-based Tokenization in NMT
- Authors: Benoist Wolleb, Romain Silvestri, Giorgos Vernikos, Ljiljana Dolamic,
Andrei Popescu-Belis
- Abstract summary: Subword tokenization is the de facto standard for tokenization in neural language models and machine translation systems.
Three advantages are frequently cited in favor of subwords: shorter encoding of frequent tokens, compositionality of subwords, and ability to deal with unknown words.
We propose a tokenization approach that enables us to separate frequency from compositionality.
- Score: 7.600968522331612
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Subword tokenization is the de facto standard for tokenization in neural
language models and machine translation systems. Three advantages are
frequently cited in favor of subwords: shorter encoding of frequent tokens,
compositionality of subwords, and ability to deal with unknown words. As their
relative importance is not entirely clear yet, we propose a tokenization
approach that enables us to separate frequency (the first advantage) from
compositionality. The approach uses Huffman coding to tokenize words, by order
of frequency, using a fixed amount of symbols. Experiments with CS-DE, EN-FR
and EN-DE NMT show that frequency alone accounts for 90%-95% of the scores
reached by BPE, hence compositionality has less importance than previously
thought.
Related papers
- Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword Tokenization [3.0023392750520883]
My submission explores whether morphological segmentation methods can be used as a part of subword tokenizers.
The prediction results show that morphological segmentation could be as effective as commonly used subword tokenizers.
A tokenizer with a balanced token frequency distribution tends to work better.
arXiv Detail & Related papers (2024-10-19T04:06:09Z) - Batching BPE Tokenization Merges [55.2480439325792]
BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm.
It is used to train a high quality tokenizer on a basic laptop.
arXiv Detail & Related papers (2024-08-05T09:37:21Z) - Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge [10.721272718226848]
We propose a combined intrinsic-extrinsic evaluation framework for subword tokenization.
Intrepid evaluation is based on our new UniMorph Labeller tool that classifies subword tokenization as either morphological or alien.
Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that alien tokenization leads to poorer generalizations.
arXiv Detail & Related papers (2024-04-20T06:49:15Z) - mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view
Contrastive Learning [54.523172171533645]
Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora.
We propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER)
Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches.
arXiv Detail & Related papers (2023-08-17T16:02:29Z) - N-gram Boosting: Improving Contextual Biasing with Normalized N-gram
Targets [1.9908600514057855]
We present a two-step keyword boosting mechanism that works on normalized unigrams and n-grams rather than just single tokens.
This improves our keyword recognition rate by 26% relative on our proprietary in-domain dataset and 2% on LibriSpeech.
arXiv Detail & Related papers (2023-08-04T00:23:14Z) - Boosting word frequencies in authorship attribution [0.0]
I introduce a simple method of computing relative word frequencies for authorship attribution and similar stylometric tasks.
The notion of relevant words includes synonyms and, usually, a few dozen other words in some ways semantically similar to a word in question.
The proposed method outperforms classical most-frequent-word approaches substantially.
arXiv Detail & Related papers (2022-11-02T17:11:35Z) - A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task
Learning [8.052271364177988]
Subword tokenization is a commonly used input pre-processing step in most recent NLP models.
We propose a vocabulary-free neural tokenizer by distilling segmentation information from subword tokenization.
Our tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks.
arXiv Detail & Related papers (2022-04-22T16:50:49Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - Multi-view Subword Regularization [111.04350390045705]
Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations.
Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
arXiv Detail & Related papers (2021-03-15T16:07:42Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z) - Token-level Adaptive Training for Neural Machine Translation [84.69646428587548]
There exists a token imbalance phenomenon in natural language as different tokens appear with different frequencies.
vanilla NMT model usually adopts trivial equal-weighted objectives for target tokens with different frequencies.
Low-frequency tokens may carry critical semantic information that will affect the translation quality once they are neglected.
arXiv Detail & Related papers (2020-10-09T05:55:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.