Dynamic Programming Encoding for Subword Segmentation in Neural Machine
Translation
- URL: http://arxiv.org/abs/2005.06606v2
- Date: Sat, 1 Aug 2020 09:30:27 GMT
- Title: Dynamic Programming Encoding for Subword Segmentation in Neural Machine
Translation
- Authors: Xuanli He, Gholamreza Haffari, Mohammad Norouzi
- Abstract summary: This paper introduces Dynamic Programming (DPE) a new segmentation algorithm for tokenizing sentences into subword units.
A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations.
- Score: 80.38621085548013
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces Dynamic Programming Encoding (DPE), a new segmentation
algorithm for tokenizing sentences into subword units. We view the subword
segmentation of output sentences as a latent variable that should be
marginalized out for learning and inference. A mixed character-subword
transformer is proposed, which enables exact log marginal likelihood estimation
and exact MAP inference to find target segmentations with maximum posterior
probability. DPE uses a lightweight mixed character-subword transformer as a
means of pre-processing parallel data to segment output sentences using dynamic
programming. Empirical results on machine translation suggest that DPE is
effective for segmenting output sentences and can be combined with BPE dropout
for stochastic segmentation of source sentences. DPE achieves an average
improvement of 0.9 BLEU over BPE (Sennrich et al., 2016) and an average
improvement of 0.55 BLEU over BPE dropout (Provilkov et al., 2019) on several
WMT datasets including English <=> (German, Romanian, Estonian, Finnish,
Hungarian).
Related papers
- Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models [26.442558912559658]
We show a fundamental limitation in vocabulary adaptation approaches that use Byte-Pair.
(BPE) tokenization scheme for fine-tuning pretrained language models (PLMs) to expert domains.
We propose AdaptBPE where the BPE tokenization phase is modified to first perform the longest string matching on the added (target) vocabulary before tokenizing at the character level.
arXiv Detail & Related papers (2024-10-04T09:24:55Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic
Sentence Segmentation [65.6736056006381]
We present a multilingual punctuation-agnostic sentence segmentation method covering 85 languages.
Our method outperforms all the prior best sentence-segmentation tools by an average of 6.1% F1 points.
By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points.
arXiv Detail & Related papers (2023-05-30T09:49:42Z) - Subword Segmental Machine Translation: Unifying Segmentation and Target
Sentence Generation [7.252933737829635]
Subword segmental machine translation (SSMT) learns to segment target sentence words while jointly learning to generate target sentences.
Experiments across 6 translation directions show that SSMT improves chrF scores for morphologically rich agglutinative languages.
arXiv Detail & Related papers (2023-05-11T17:44:29Z) - Non-Parametric Domain Adaptation for End-to-End Speech Translation [72.37869362559212]
End-to-End Speech Translation (E2E-ST) has received increasing attention due to the potential of its less error propagation, lower latency, and fewer parameters.
We propose a novel non-parametric method that leverages domain-specific text translation corpus to achieve domain adaptation for the E2E-ST system.
arXiv Detail & Related papers (2022-05-23T11:41:02Z) - LCP-dropout: Compression-based Multiple Subword Segmentation for Neural
Machine Translation [5.505045114759599]
We propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm.
BPE/BPE-dropout is one of the fastest and most effective method compared to conventional approaches.
We propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and show that it outperforms various baselines in learning from especially small training data.
arXiv Detail & Related papers (2022-02-28T07:49:07Z) - Syntax-aware Data Augmentation for Neural Machine Translation [76.99198797021454]
We propose a novel data augmentation strategy for neural machine translation.
We set sentence-specific probability for word selection by considering their roles in sentence.
Our proposed method is evaluated on WMT14 English-to-German dataset and IWSLT14 German-to-English dataset.
arXiv Detail & Related papers (2020-04-29T13:45:30Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.