How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in
Neural Machine Translation?
- URL: http://arxiv.org/abs/2208.05225v1
- Date: Wed, 10 Aug 2022 08:57:13 GMT
- Title: How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in
Neural Machine Translation?
- Authors: Ali Araabi, Christof Monz, Vlad Niculae
- Abstract summary: We analyze the translation quality of OOV words based on word type, number of segments, cross-attention, and the frequency of segment n-grams.
Our experiments show that while careful BPE settings seem to be fairly useful in translating OOV words across weights, a considerable percentage of OOV words are translated incorrectly.
- Score: 17.300004156754966
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural Machine Translation (NMT) is an open vocabulary problem. As a result,
dealing with the words not occurring during training (a.k.a. out-of-vocabulary
(OOV) words) have long been a fundamental challenge for NMT systems. The
predominant method to tackle this problem is Byte Pair Encoding (BPE) which
splits words, including OOV words, into sub-word segments. BPE has achieved
impressive results for a wide range of translation tasks in terms of automatic
evaluation metrics. While it is often assumed that by using BPE, NMT systems
are capable of handling OOV words, the effectiveness of BPE in translating OOV
words has not been explicitly measured. In this paper, we study to what extent
BPE is successful in translating OOV words at the word-level. We analyze the
translation quality of OOV words based on word type, number of segments,
cross-attention weights, and the frequency of segment n-grams in the training
data. Our experiments show that while careful BPE settings seem to be fairly
useful in translating OOV words across datasets, a considerable percentage of
OOV words are translated incorrectly. Furthermore, we highlight the slightly
higher effectiveness of BPE in translating OOV words for special cases, such as
named-entities and when the languages involved are linguistically close to each
other.
Related papers
- An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - An approach for mistranslation removal from popular dataset for Indic MT
Task [5.4755933832880865]
We propose an algorithm to remove mistranslations from the training corpus and evaluate its performance and efficiency.
Two Indic languages (ILs), namely, Hindi (HIN) and Odia (ODI) are chosen for the experiment.
The quality of the translations in the experiment is evaluated using standard metrics such as BLEU, METEOR, and RIBES.
arXiv Detail & Related papers (2024-01-12T06:37:19Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - When do Contrastive Word Alignments Improve Many-to-many Neural Machine
Translation? [33.28706502928905]
This work proposes a word-level contrastive objective to leverage word alignments for many-to-many NMT.
Analyses reveal that in many-to-many NMT, the encoder's sentence retrieval performance highly correlates with the translation quality.
arXiv Detail & Related papers (2022-04-26T09:07:51Z) - Phrase-level Active Learning for Neural Machine Translation [107.28450614074002]
We propose an active learning setting where we can spend a given budget on translating in-domain data.
We select both full sentences and individual phrases from unlabelled data in the new domain for routing to human translators.
In a German-English translation task, our active learning approach achieves consistent improvements over uncertainty-based sentence selection methods.
arXiv Detail & Related papers (2021-06-21T19:20:42Z) - VOLT: Improving Vocabularization via Optimal Transport for Machine
Translation [22.07373011242121]
We find an exciting relation between an information-theoretic feature and BLEU scores.
We propose VOLT, a simple and efficient vocabularization solution without the full and costly trial training.
VOLT achieves 70% vocabulary size reduction and 0.6 BLEU gain on English-German translation.
arXiv Detail & Related papers (2020-12-31T15:49:49Z) - Dynamic Programming Encoding for Subword Segmentation in Neural Machine
Translation [80.38621085548013]
This paper introduces Dynamic Programming (DPE) a new segmentation algorithm for tokenizing sentences into subword units.
A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations.
arXiv Detail & Related papers (2020-05-03T05:00:50Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z) - Bootstrapping a Crosslingual Semantic Parser [74.99223099702157]
We adapt a semantic trained on a single language, such as English, to new languages and multiple domains with minimal annotation.
We query if machine translation is an adequate substitute for training data, and extend this to investigate bootstrapping using joint training with English, paraphrasing, and multilingual pre-trained models.
arXiv Detail & Related papers (2020-04-06T12:05:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.