Related papers: Deterministic Reversible Data Augmentation for Neural Machine Translation

Deterministic Reversible Data Augmentation for Neural Machine Translation

URL: http://arxiv.org/abs/2406.02517v1
Date: Tue, 4 Jun 2024 17:39:23 GMT
Title: Deterministic Reversible Data Augmentation for Neural Machine Translation
Authors: Jiashu Yao, Heyan Huang, Zeming Liu, Yuhang Guo,
Abstract summary: We propose Deterministic Reversible Data Augmentation (DRDA), a simple but effective data augmentation method for neural machine translation. With no extra corpora or model changes required, DRDA outperforms strong baselines on several translation tasks with a clear margin. DRDA exhibits good robustness in noisy, low-resource, and cross-domain datasets.
Score: 36.10695293724949
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data augmentation is an effective way to diversify corpora in machine translation, but previous methods may introduce semantic inconsistency between original and augmented data because of irreversible operations and random subword sampling procedures. To generate both symbolically diverse and semantically consistent augmentation data, we propose Deterministic Reversible Data Augmentation (DRDA), a simple but effective data augmentation method for neural machine translation. DRDA adopts deterministic segmentations and reversible operations to generate multi-granularity subword representations and pulls them closer together with multi-view techniques. With no extra corpora or model changes required, DRDA outperforms strong baselines on several translation tasks with a clear margin (up to 4.3 BLEU gain over Transformer) and exhibits good robustness in noisy, low-resource, and cross-domain datasets.

Related papers

Diversity-oriented Data Augmentation with Large Language Models [9.548912625579947]
We propose a textbfunderlineDi-textbfunderlineoriented data textbfunderlineAugmentation framework (textbfDoAug)<n>Specifically, we utilize a diversity-oriented fine-tuning approach to train an LLM as a diverse paraphraser, which is capable of augmenting textual datasets by generating diversified paraphrases.<n>The results show that our fine-tuned LLM augmenter improves diversity while preserving label consistency, thereby enhancing the robustness and performance of downstream tasks.
arXiv Detail & Related papers (2025-02-17T11:00:40Z)
Optimizing Non-Autoregressive Transformers with Contrastive Learning [74.46714706658517]
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order. In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution.
arXiv Detail & Related papers (2023-05-23T04:20:13Z)
Recurrence Boosts Diversity! Revisiting Recurrent Latent Variable in Transformer-Based Variational AutoEncoder for Diverse Text Generation [85.5379146125199]
Variational Auto-Encoder (VAE) has been widely adopted in text generation. We propose TRACE, a Transformer-based recurrent VAE structure.
arXiv Detail & Related papers (2022-10-22T10:25:35Z)
Semantically Consistent Data Augmentation for Neural Machine Translation via Conditional Masked Language Model [5.756426081817803]
This paper introduces a new data augmentation method for neural machine translation. Our method is based on Conditional Masked Language Model (CMLM) We show that CMLM is capable of enforcing semantic consistency by conditioning on both source and target during substitution.
arXiv Detail & Related papers (2022-09-22T09:19:08Z)
A Cognitive Study on Semantic Similarity Analysis of Large Corpora: A Transformer-based Approach [0.0]
We perform semantic similarity analysis and modeling on the U.S. Patent Phrase to Phrase Matching dataset using both traditional and transformer-based techniques. The experimental results demonstrate our methodology's enhanced performance compared to traditional techniques, with an average Pearson correlation score of 0.79.
arXiv Detail & Related papers (2022-07-24T11:06:56Z)
Learning to Generalize to More: Continuous Semantic Augmentation for Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT) CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z)
Rethinking Data Augmentation for Low-Resource Neural Machine Translation: A Multi-Task Learning Approach [0.0]
Data augmentation (DA) techniques may be used for generating additional training samples when the available parallel data are scarce. We present a multi-task DA approach in which we generate new sentence pairs with transformations. We show consistent improvements over the baseline and over DA methods aiming at extending the support of the empirical data distribution.
arXiv Detail & Related papers (2021-09-08T13:39:30Z)
Uncertainty-Aware Semantic Augmentation for Neural Machine Translation [37.555675157198145]
We propose uncertainty-aware semantic augmentation, which explicitly captures the universal semantic information among multiple semantically-equivalent source sentences. Our approach significantly outperforms the strong baselines and the existing methods.
arXiv Detail & Related papers (2020-10-09T07:48:09Z)
Learning Source Phrase Representations for Neural Machine Translation [65.94387047871648]
We propose an attentive phrase representation generation mechanism which is able to generate phrase representations from corresponding token representations. In our experiments, we obtain significant improvements on the WMT 14 English-German and English-French tasks on top of the strong Transformer baseline.
arXiv Detail & Related papers (2020-06-25T13:43:11Z)
Syntax-aware Data Augmentation for Neural Machine Translation [76.99198797021454]
We propose a novel data augmentation strategy for neural machine translation. We set sentence-specific probability for word selection by considering their roles in sentence. Our proposed method is evaluated on WMT14 English-to-German dataset and IWSLT14 German-to-English dataset.
arXiv Detail & Related papers (2020-04-29T13:45:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.