Related papers: VOLT: Improving Vocabularization via Optimal Transport for Machine Translation

VOLT: Improving Vocabularization via Optimal Transport for Machine Translation

URL: http://arxiv.org/abs/2012.15671v1
Date: Thu, 31 Dec 2020 15:49:49 GMT
Title: VOLT: Improving Vocabularization via Optimal Transport for Machine Translation
Authors: Jingjing Xu, Hao Zhou, Chun Gan, Zaixiang Zheng, Lei Li
Abstract summary: We find an exciting relation between an information-theoretic feature and BLEU scores. We propose VOLT, a simple and efficient vocabularization solution without the full and costly trial training. VOLT achieves 70% vocabulary size reduction and 0.6 BLEU gain on English-German translation.
Score: 22.07373011242121
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: It is well accepted that the choice of token vocabulary largely affects the performance of machine translation. However, due to expensive trial costs, most studies only conduct simple trials with dominant approaches (e.g BPE) and commonly used vocabulary sizes. In this paper, we find an exciting relation between an information-theoretic feature and BLEU scores. With this observation, we formulate the quest of vocabularization -- finding the best token dictionary with a proper size -- as an optimal transport problem. We then propose VOLT, a simple and efficient vocabularization solution without the full and costly trial training. We evaluate our approach on multiple machine translation tasks, including WMT-14 English-German translation, TED bilingual translation, and TED multilingual translation. Empirical results show that VOLT beats widely-used vocabularies on diverse scenarios. For example, VOLT achieves 70% vocabulary size reduction and 0.6 BLEU gain on English-German translation. Also, one advantage of VOLT lies in its low resource consumption. Compared to naive BPE-search, VOLT reduces the search time from 288 GPU hours to 0.5 CPU hours.

Related papers

SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models [47.604473591750605]
This paper proposes a novel task called textbfAutomatic textbfDictionary textbfSelection (textbfADS)<n>The goal of the task is to automatically select which dictionary to use to enhance translation.
arXiv Detail & Related papers (2025-07-25T02:51:14Z)
Deep Reasoning Translation via Reinforcement Learning [77.41383117199227]
We introduce DeepTrans, a deep reasoning translation model that learns free translation via reinforcement learning. Given the source sentences, the reward model teaches the deep translation model how to think and free-translate them during reinforcement learning. Experimental results show that DeepTrans improves performance by 16.3% in literature translation.
arXiv Detail & Related papers (2025-04-14T12:40:39Z)
SuperBPE: Space Travel for Language Models [112.64910939119056]
We introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm. SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks.
arXiv Detail & Related papers (2025-03-17T17:53:23Z)
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training [8.012203293561196]
Picky BPE is a modified BPE algorithm that carries out vocabulary refinement during tokenizer training. Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression.
arXiv Detail & Related papers (2024-09-06T20:12:34Z)
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords. We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z)
The Best of Both Worlds: Combining Human and Machine Translations for Multilingual Semantic Parsing with Active Learning [50.320178219081484]
We propose an active learning approach that exploits the strengths of both human and machine translations. An ideal utterance selection can significantly reduce the error and bias in the translated data.
arXiv Detail & Related papers (2023-05-22T05:57:47Z)
On the Off-Target Problem of Zero-Shot Multilingual Neural Machine Translation [104.85258654917297]
We find that failing in encoding discriminative target language signal will lead to off-target and a closer lexical distance. We propose Language Aware Vocabulary Sharing (LAVS) to construct the multilingual vocabulary. We conduct experiments on a multilingual machine translation benchmark in 11 languages.
arXiv Detail & Related papers (2023-05-18T12:43:31Z)
Fast Vocabulary Projection Method via Clustering for Multilingual Machine Translation on GPU [6.1646755570223934]
This paper proposes a fast vocabulary projection method via clustering. The proposed method speeds up the vocab projection step itself by up to 2.6x. We also conduct an extensive human evaluation to verify the proposed method preserves the quality of the translations from the original model.
arXiv Detail & Related papers (2022-08-14T16:10:14Z)
How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation? [17.300004156754966]
We analyze the translation quality of OOV words based on word type, number of segments, cross-attention, and the frequency of segment n-grams. Our experiments show that while careful BPE settings seem to be fairly useful in translating OOV words across weights, a considerable percentage of OOV words are translated incorrectly.
arXiv Detail & Related papers (2022-08-10T08:57:13Z)
Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training [59.571632468137075]
We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. We propose an algorithm VoCap to determine the desired vocabulary capacity of each language. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax.
arXiv Detail & Related papers (2021-09-15T14:04:16Z)
ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee. It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z)
Bilingual Dictionary Based Neural Machine Translation without Using Parallel Sentences [45.99290614777277]
We propose a new task of machine translation (MT) based on no parallel sentences but can refer to a ground-truth bilingual dictionary. Motivated by the ability of a monolingual speaker learning to translate via looking up the bilingual dictionary, we propose the task to see how much potential an MT system can attain.
arXiv Detail & Related papers (2020-07-06T12:05:27Z)
Knowledge Distillation for Multilingual Unsupervised Neural Machine Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs. UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time. In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z)
Neural Machine Translation for Low-Resourced Indian Languages [4.726777092009554]
Machine translation is an effective approach to convert text to a different language without any human involvement. In this paper, we have applied NMT on two of the most morphological rich Indian languages, i.e. English-Tamil and English-Malayalam. We proposed a novel NMT model using Multihead self-attention along with pre-trained Byte-Pair-Encoded (BPE) and MultiBPE embeddings to develop an efficient translation system.
arXiv Detail & Related papers (2020-04-19T17:29:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.