Related papers: LCP-dropout: Compression-based Multiple Subword Segmentation for Neural Machine Translation

LCP-dropout: Compression-based Multiple Subword Segmentation for Neural Machine Translation

URL: http://arxiv.org/abs/2202.13590v1
Date: Mon, 28 Feb 2022 07:49:07 GMT
Title: LCP-dropout: Compression-based Multiple Subword Segmentation for Neural Machine Translation
Authors: Keita Nonaka, Kazutaka Yamanouchi, Tomohiro I, Tsuyoshi Okita, Kazutaka Shimada, Hiroshi Sakamoto
Abstract summary: We propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. BPE/BPE-dropout is one of the fastest and most effective method compared to conventional approaches. We propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and show that it outperforms various baselines in learning from especially small training data.
Score: 5.505045114759599
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in Neural Machine Translation. Among them, BPE/BPE-dropout is one of the fastest and most effective method compared to conventional approaches. However, compression-based approach has a drawback in that generating multiple segmentations is difficult due to the determinism. To overcome this difficulty, we focus on a probabilistic string algorithm, called locally-consistent parsing (LCP), that has been applied to achieve optimum compression. Employing the probabilistic mechanism of LCP, we propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and show that it outperforms various baselines in learning from especially small training data.

Related papers

When Every Token Counts: Optimal Segmentation for Low-Resource Language Models [0.0]
We show that an optimal Byte-Pair (BPE) configuration significantly reduces token count compared to greedy segmentation. Our findings suggest that compression-optimized tokenization strategies could provide substantial advantages for multilingual and low-resource language applications.
arXiv Detail & Related papers (2024-12-09T19:11:54Z)
Theoretical Analysis of Byte-Pair Encoding [0.8655526882770742]
Byte-Pair (BPE) is a widely used method for subword tokenization. We show that BPE approximates the compression utility of the optimal pair encoding to a worst-case factor.
arXiv Detail & Related papers (2024-11-13T15:04:02Z)
ECNR: Efficient Compressive Neural Representation of Time-Varying Volumetric Datasets [6.3492793442257085]
compressive neural representation has emerged as a promising alternative to traditional compression methods for managing massive datasets. This paper presents an efficient neural representation (ECNR) solution for time-varying data compression. We show the effectiveness of ECNR with multiple datasets and compare it with state-of-the-art compression methods.
arXiv Detail & Related papers (2023-10-02T06:06:32Z)
SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT) This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method. SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z)
Single Model Ensemble for Subword Regularized Models in Low-Resource Machine Translation [25.04086897886412]
Subword regularizations use multiple subword segmentations during training to improve the robustness of neural machine translation models. We propose an inference strategy to address this discrepancy. Experimental results show that the proposed strategy improves the performance of models trained with subword regularization.
arXiv Detail & Related papers (2022-03-25T09:25:47Z)
DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins. Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z)
Single-channel speech separation using Soft-minimum Permutation Invariant Training [60.99112031408449]
A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal. Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem. In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment.
arXiv Detail & Related papers (2021-11-16T17:25:05Z)
Deep Gaussian Processes for Few-Shot Segmentation [66.08463078545306]
Few-shot segmentation is a challenging task, requiring the extraction of a generalizable representation from only a few annotated samples. We propose a few-shot learner formulation based on Gaussian process (GP) regression. Our approach sets a new state-of-the-art for 5-shot segmentation, with mIoU scores of 68.1 and 49.8 on PASCAL-5i and COCO-20i, respectively.
arXiv Detail & Related papers (2021-03-30T17:56:32Z)
PowerGossip: Practical Low-Rank Communication Compression in Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers. Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z)
Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation [80.38621085548013]
This paper introduces Dynamic Programming (DPE) a new segmentation algorithm for tokenizing sentences into subword units. A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations.
arXiv Detail & Related papers (2020-05-03T05:00:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.