Related papers: An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

URL: http://arxiv.org/abs/2404.00397v1
Date: Sat, 30 Mar 2024 15:29:49 GMT
Title: An Analysis of BPE Vocabulary Trimming in Neural Machine Translation
Authors: Marco Cognetta, Tatsuya Hiraoka, Naoaki Okazaki, Rico Sennrich, Yuval Pinter,
Abstract summary: vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords. We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
Score: 56.383793805299234
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords. The technique is available in popular tokenization libraries but has not been subjected to rigorous scientific scrutiny. While the removal of rare subwords is suggested as best practice in machine translation implementations, both as a means to reduce model size and for improving model performance through robustness, our experiments indicate that, across a large space of hyperparameter settings, vocabulary trimming fails to improve performance, and is even prone to incurring heavy degradation.

Related papers

Overcoming Vocabulary Constraints with Pixel-level Fallback [9.753745943931207]
Subword tokenization requires balancing computational efficiency and vocabulary coverage. We propose a vocabulary-free encoder that generates input embeddings from text rendered as pixels.
arXiv Detail & Related papers (2025-04-02T20:50:31Z)
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT [59.245414547751636]
We propose a circuit discovery framework alternative to activation patching. Our framework suffers less from out-of-distribution and proves to be more efficient in terms of complexity. We dig in a small transformer trained on a synthetic task named Othello and find a number of human-understandable fine-grained circuits inside of it.
arXiv Detail & Related papers (2024-02-19T15:04:53Z)
Bit Cipher -- A Simple yet Powerful Word Representation System that Integrates Efficiently with Language Models [4.807347156077897]
Bit-cipher is a word representation system that eliminates the need of backpropagation and hyper-efficient dimensionality reduction techniques. We perform probing experiments on part-of-speech (POS) tagging and named entity recognition (NER) to assess bit-cipher's competitiveness with classic embeddings. By replacing embedding layers with cipher embeddings, our experiments illustrate the notable efficiency of cipher in accelerating the training process and attaining better optima.
arXiv Detail & Related papers (2023-11-18T08:47:35Z)
ngram-OAXE: Phrase-Based Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation [51.06378042344563]
A new training oaxe loss has proven effective to ameliorate the effect of multimodality for non-autoregressive translation (NAT) We extend oaxe by only allowing reordering between ngram phrases and still requiring a strict match of word order within the phrases. Further analyses show that ngram-oaxe indeed improves the translation of ngram phrases, and produces more fluent translation with a better modeling of sentence structure.
arXiv Detail & Related papers (2022-10-08T11:39:15Z)
A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning [8.052271364177988]
Subword tokenization is a commonly used input pre-processing step in most recent NLP models. We propose a vocabulary-free neural tokenizer by distilling segmentation information from subword tokenization. Our tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks.
arXiv Detail & Related papers (2022-04-22T16:50:49Z)
Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs. In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z)
Frequency-Aware Contrastive Learning for Neural Machine Translation [24.336356651877388]
Low-frequency word prediction remains a challenge in modern neural machine translation (NMT) systems. Inspired by the observation that low-frequency words form a more compact embedding space, we tackle this challenge from a representation learning perspective. We propose a frequency-aware token-level contrastive learning method, in which the hidden state of each decoding step is pushed away from the counterparts of other target words.
arXiv Detail & Related papers (2021-12-29T10:10:10Z)
DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus. We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z)
Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in Non-Autoregressive Translation [98.11249019844281]
Knowledge distillation (KD) is commonly used to construct synthetic data for training non-autoregressive translation (NAT) models. We propose reverse KD to rejuvenate more alignments for low-frequency target words. Results demonstrate that the proposed approach can significantly and universally improve translation quality.
arXiv Detail & Related papers (2021-06-02T02:41:40Z)
Improving Lexically Constrained Neural Machine Translation with Source-Conditioned Masked Span Prediction [6.46964825569749]
In this paper, we tackle a more challenging setup consisting of domain-specific corpora with much longer n-gram and highly specialized terms. To encourage span-level representations in generation, we additionally impose a source-sentence conditioned masked span prediction loss in the decoder. Experimental results on three domain-specific corpora in two language pairs demonstrate that the proposed training scheme can improve the performance of existing lexically constrained methods.
arXiv Detail & Related papers (2021-05-12T08:11:33Z)
Adversarial Subword Regularization for Robust Neural Machine Translation [23.968624881678913]
Exposing diverse subword segmentations to neural machine translation (NMT) models often improves the robustness of machine translation. We present adversarial subword regularization (ADVSR) to study whether gradient signals during training can be a substitute criterion for exposing diverse subword segmentations.
arXiv Detail & Related papers (2020-04-29T12:06:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.