LCP-dropout: Compression-based Multiple Subword Segmentation for Neural
Machine Translation
- URL: http://arxiv.org/abs/2202.13590v1
- Date: Mon, 28 Feb 2022 07:49:07 GMT
- Title: LCP-dropout: Compression-based Multiple Subword Segmentation for Neural
Machine Translation
- Authors: Keita Nonaka, Kazutaka Yamanouchi, Tomohiro I, Tsuyoshi Okita,
Kazutaka Shimada, Hiroshi Sakamoto
- Abstract summary: We propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm.
BPE/BPE-dropout is one of the fastest and most effective method compared to conventional approaches.
We propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and show that it outperforms various baselines in learning from especially small training data.
- Score: 5.505045114759599
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this study, we propose a simple and effective preprocessing method for
subword segmentation based on a data compression algorithm. Compression-based
subword segmentation has recently attracted significant attention as a
preprocessing method for training data in Neural Machine Translation. Among
them, BPE/BPE-dropout is one of the fastest and most effective method compared
to conventional approaches. However, compression-based approach has a drawback
in that generating multiple segmentations is difficult due to the determinism.
To overcome this difficulty, we focus on a probabilistic string algorithm,
called locally-consistent parsing (LCP), that has been applied to achieve
optimum compression. Employing the probabilistic mechanism of LCP, we propose
LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout,
and show that it outperforms various baselines in learning from especially
small training data.
Related papers
- ECNR: Efficient Compressive Neural Representation of Time-Varying
Volumetric Datasets [6.3492793442257085]
compressive neural representation has emerged as a promising alternative to traditional compression methods for managing massive datasets.
This paper presents an efficient neural representation (ECNR) solution for time-varying data compression.
We show the effectiveness of ECNR with multiple datasets and compare it with state-of-the-art compression methods.
arXiv Detail & Related papers (2023-10-02T06:06:32Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - Single Model Ensemble for Subword Regularized Models in Low-Resource
Machine Translation [25.04086897886412]
Subword regularizations use multiple subword segmentations during training to improve the robustness of neural machine translation models.
We propose an inference strategy to address this discrepancy.
Experimental results show that the proposed strategy improves the performance of models trained with subword regularization.
arXiv Detail & Related papers (2022-03-25T09:25:47Z) - DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition.
DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z) - Single-channel speech separation using Soft-minimum Permutation
Invariant Training [60.99112031408449]
A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal.
Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem.
In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment.
arXiv Detail & Related papers (2021-11-16T17:25:05Z) - Deep Gaussian Processes for Few-Shot Segmentation [66.08463078545306]
Few-shot segmentation is a challenging task, requiring the extraction of a generalizable representation from only a few annotated samples.
We propose a few-shot learner formulation based on Gaussian process (GP) regression.
Our approach sets a new state-of-the-art for 5-shot segmentation, with mIoU scores of 68.1 and 49.8 on PASCAL-5i and COCO-20i, respectively.
arXiv Detail & Related papers (2021-03-30T17:56:32Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z) - Dynamic Programming Encoding for Subword Segmentation in Neural Machine
Translation [80.38621085548013]
This paper introduces Dynamic Programming (DPE) a new segmentation algorithm for tokenizing sentences into subword units.
A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations.
arXiv Detail & Related papers (2020-05-03T05:00:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.