SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation
- URL: http://arxiv.org/abs/2307.16400v1
- Date: Mon, 31 Jul 2023 04:38:47 GMT
- Title: SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation
- Authors: Haiyue Song, Raj Dabre, Chenhui Chu, Sadao Kurohashi, and Eiichiro
Sumita
- Abstract summary: Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
- Score: 51.881877192924414
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sub-word segmentation is an essential pre-processing step for Neural Machine
Translation (NMT). Existing work has shown that neural sub-word segmenters are
better than Byte-Pair Encoding (BPE), however, they are inefficient as they
require parallel corpora, days to train and hours to decode. This paper
introduces SelfSeg, a self-supervised neural sub-word segmentation method that
is much faster to train/decode and requires only monolingual dictionaries
instead of parallel corpora. SelfSeg takes as input a word in the form of a
partially masked character sequence, optimizes the word generation probability
and generates the segmentation with the maximum posterior probability, which is
calculated using a dynamic programming algorithm. The training time of SelfSeg
depends on word frequencies, and we explore several word frequency
normalization strategies to accelerate the training phase. Additionally, we
propose a regularization mechanism that allows the segmenter to generate
various segmentations for one word. To show the effectiveness of our approach,
we conduct MT experiments in low-, middle- and high-resource scenarios, where
we compare the performance of using different segmentation methods. The
experimental results demonstrate that on the low-resource ALT dataset, our
method achieves more than 1.2 BLEU score improvement compared with BPE and
SentencePiece, and a 1.1 score improvement over Dynamic Programming Encoding
(DPE) and Vocabulary Learning via Optimal Transport (VOLT) on average. The
regularization method achieves approximately a 4.3 BLEU score improvement over
BPE and a 1.2 BLEU score improvement over BPE-dropout, the regularized version
of BPE. We also observed significant improvements on IWSLT15 Vi->En, WMT16
Ro->En and WMT15 Fi->En datasets, and competitive results on the WMT14 De->En
and WMT14 Fr->En datasets.
Related papers
- Subword Segmental Machine Translation: Unifying Segmentation and Target
Sentence Generation [7.252933737829635]
Subword segmental machine translation (SSMT) learns to segment target sentence words while jointly learning to generate target sentences.
Experiments across 6 translation directions show that SSMT improves chrF scores for morphologically rich agglutinative languages.
arXiv Detail & Related papers (2023-05-11T17:44:29Z) - Effects of sub-word segmentation on performance of transformer language
models [0.628122931748758]
We compare GPT and BERT models trained with the statistical segmentation algorithm BPE vs. two unsupervised algorithms for morphological segmentation.
We show that training with morphological segmentation allows the LMs to: 1. achieve lower perplexity, 2. converge more efficiently in terms of training time, and 3. achieve equivalent or better evaluation scores on downstream tasks.
arXiv Detail & Related papers (2023-05-09T14:30:29Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - The Effectiveness of Morphology-aware Segmentation in Low-Resource
Neural Machine Translation [0.6091702876917281]
This paper evaluates the performance of several modern subword segmentation methods in a low-resource neural machine translation setting.
We compare segmentations produced by applying BPE at the token or sentence level with morphologically-based segmentations from LMVR and MORSEL.
arXiv Detail & Related papers (2021-03-20T14:39:25Z) - BEDS: Bagging ensemble deep segmentation for nucleus segmentation with
testing stage stain augmentation [6.80053986075991]
bootstrap aggregating, also known as bagging, is a canonical ensemble algorithm for aggregating weak learners to become a strong learner.
We propose a simple bagging ensemble deep segmentation (BEDs) method to train multiple U-Nets with partial training data to segment dense nuclei on pathological images.
The contributions of this study are three-fold: (1) developing a self-ensemble learning framework for nucleus segmentation; (2) aggregating testing stage augmentation with self-ensemble learning; and (3) elucidating the idea that self-ensemble and testing stage stain augmentation are complementary strategies for a superior segmentation performance.
arXiv Detail & Related papers (2021-02-17T19:34:41Z) - Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes.
An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences.
The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv Detail & Related papers (2020-10-12T19:42:25Z) - Dynamic Programming Encoding for Subword Segmentation in Neural Machine
Translation [80.38621085548013]
This paper introduces Dynamic Programming (DPE) a new segmentation algorithm for tokenizing sentences into subword units.
A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations.
arXiv Detail & Related papers (2020-05-03T05:00:50Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z) - Multilingual Denoising Pre-training for Neural Machine Translation [132.66750663226287]
mBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora.
mBART is one of the first methods for pre-training a complete sequence-to-sequence model.
arXiv Detail & Related papers (2020-01-22T18:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.