SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
  Machine Translation
        - URL: http://arxiv.org/abs/2307.16400v1
 - Date: Mon, 31 Jul 2023 04:38:47 GMT
 - Title: SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
  Machine Translation
 - Authors: Haiyue Song, Raj Dabre, Chenhui Chu, Sadao Kurohashi, and Eiichiro
  Sumita
 - Abstract summary: Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
 - Score: 51.881877192924414
 - License: http://creativecommons.org/licenses/by/4.0/
 - Abstract:   Sub-word segmentation is an essential pre-processing step for Neural Machine
Translation (NMT). Existing work has shown that neural sub-word segmenters are
better than Byte-Pair Encoding (BPE), however, they are inefficient as they
require parallel corpora, days to train and hours to decode. This paper
introduces SelfSeg, a self-supervised neural sub-word segmentation method that
is much faster to train/decode and requires only monolingual dictionaries
instead of parallel corpora. SelfSeg takes as input a word in the form of a
partially masked character sequence, optimizes the word generation probability
and generates the segmentation with the maximum posterior probability, which is
calculated using a dynamic programming algorithm. The training time of SelfSeg
depends on word frequencies, and we explore several word frequency
normalization strategies to accelerate the training phase. Additionally, we
propose a regularization mechanism that allows the segmenter to generate
various segmentations for one word. To show the effectiveness of our approach,
we conduct MT experiments in low-, middle- and high-resource scenarios, where
we compare the performance of using different segmentation methods. The
experimental results demonstrate that on the low-resource ALT dataset, our
method achieves more than 1.2 BLEU score improvement compared with BPE and
SentencePiece, and a 1.1 score improvement over Dynamic Programming Encoding
(DPE) and Vocabulary Learning via Optimal Transport (VOLT) on average. The
regularization method achieves approximately a 4.3 BLEU score improvement over
BPE and a 1.2 BLEU score improvement over BPE-dropout, the regularized version
of BPE. We also observed significant improvements on IWSLT15 Vi->En, WMT16
Ro->En and WMT15 Fi->En datasets, and competitive results on the WMT14 De->En
and WMT14 Fr->En datasets.
 
       
      
        Related papers
        - MorphTok: Morphologically Grounded Tokenization for Indian Languages [23.58043476541051]
Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs)
We propose morphology-aware segmentation as a pre-tokenization step prior to applying subword tokenization.
We also introduce Constrained BPE, an extension to the traditional BPE algorithm that incorporates script-specific constraints.
arXiv  Detail & Related papers  (2025-04-14T15:44:45Z) - Subword Segmental Machine Translation: Unifying Segmentation and Target
  Sentence Generation [7.252933737829635]
Subword segmental machine translation (SSMT) learns to segment target sentence words while jointly learning to generate target sentences.
Experiments across 6 translation directions show that SSMT improves chrF scores for morphologically rich agglutinative languages.
arXiv  Detail & Related papers  (2023-05-11T17:44:29Z) - Effects of sub-word segmentation on performance of transformer language
  models [0.628122931748758]
We compare GPT and BERT models trained with the statistical segmentation algorithm BPE vs. two unsupervised algorithms for morphological segmentation.
We show that training with morphological segmentation allows the LMs to: 1. achieve lower perplexity, 2. converge more efficiently in terms of training time, and 3. achieve equivalent or better evaluation scores on downstream tasks.
arXiv  Detail & Related papers  (2023-05-09T14:30:29Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv  Detail & Related papers  (2021-06-10T10:18:23Z) - The Effectiveness of Morphology-aware Segmentation in Low-Resource
  Neural Machine Translation [0.6091702876917281]
This paper evaluates the performance of several modern subword segmentation methods in a low-resource neural machine translation setting.
We compare segmentations produced by applying BPE at the token or sentence level with morphologically-based segmentations from LMVR and MORSEL.
arXiv  Detail & Related papers  (2021-03-20T14:39:25Z) - BEDS: Bagging ensemble deep segmentation for nucleus segmentation with
  testing stage stain augmentation [6.80053986075991]
bootstrap aggregating, also known as bagging, is a canonical ensemble algorithm for aggregating weak learners to become a strong learner.
We propose a simple bagging ensemble deep segmentation (BEDs) method to train multiple U-Nets with partial training data to segment dense nuclei on pathological images.
The contributions of this study are three-fold: (1) developing a self-ensemble learning framework for nucleus segmentation; (2) aggregating testing stage augmentation with self-ensemble learning; and (3) elucidating the idea that self-ensemble and testing stage stain augmentation are complementary strategies for a superior segmentation performance.
arXiv  Detail & Related papers  (2021-02-17T19:34:41Z) - Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes.
An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences.
The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv  Detail & Related papers  (2020-10-12T19:42:25Z) - Dynamic Programming Encoding for Subword Segmentation in Neural Machine
  Translation [80.38621085548013]
This paper introduces Dynamic Programming  (DPE) a new segmentation algorithm for tokenizing sentences into subword units.
A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations.
arXiv  Detail & Related papers  (2020-05-03T05:00:50Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv  Detail & Related papers  (2020-04-07T21:21:06Z) - Multilingual Denoising Pre-training for Neural Machine Translation [132.66750663226287]
mBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora.
mBART is one of the first methods for pre-training a complete sequence-to-sequence model.
arXiv  Detail & Related papers  (2020-01-22T18:59:17Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.