Syntax-aware Data Augmentation for Neural Machine Translation
- URL: http://arxiv.org/abs/2004.14200v1
- Date: Wed, 29 Apr 2020 13:45:30 GMT
- Title: Syntax-aware Data Augmentation for Neural Machine Translation
- Authors: Sufeng Duan, Hai Zhao, Dongdong Zhang, Rui Wang
- Abstract summary: We propose a novel data augmentation strategy for neural machine translation.
We set sentence-specific probability for word selection by considering their roles in sentence.
Our proposed method is evaluated on WMT14 English-to-German dataset and IWSLT14 German-to-English dataset.
- Score: 76.99198797021454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data augmentation is an effective performance enhancement in neural machine
translation (NMT) by generating additional bilingual data. In this paper, we
propose a novel data augmentation enhancement strategy for neural machine
translation. Different from existing data augmentation methods which simply
choose words with the same probability across different sentences for
modification, we set sentence-specific probability for word selection by
considering their roles in sentence. We use dependency parse tree of input
sentence as an effective clue to determine selecting probability for every
words in each sentence. Our proposed method is evaluated on WMT14
English-to-German dataset and IWSLT14 German-to-English dataset. The result of
extensive experiments show our proposed syntax-aware data augmentation method
may effectively boost existing sentence-independent methods for significant
translation performance improvement.
Related papers
- Deterministic Reversible Data Augmentation for Neural Machine Translation [36.10695293724949]
We propose Deterministic Reversible Data Augmentation (DRDA), a simple but effective data augmentation method for neural machine translation.
With no extra corpora or model changes required, DRDA outperforms strong baselines on several translation tasks with a clear margin.
DRDA exhibits good robustness in noisy, low-resource, and cross-domain datasets.
arXiv Detail & Related papers (2024-06-04T17:39:23Z) - Cross-lingual Transfer or Machine Translation? On Data Augmentation for
Monolingual Semantic Textual Similarity [2.422759879602353]
Cross-lingual transfer of Wikipedia data exhibits improved performance for monolingual STS.
We find a superiority of the Wikipedia domain over the NLI domain for these languages, in contrast to prior studies that focused on NLI as training data.
arXiv Detail & Related papers (2024-03-08T12:28:15Z) - Improving Domain-Specific Retrieval by NLI Fine-Tuning [64.79760042717822]
This article investigates the fine-tuning potential of natural language inference (NLI) data to improve information retrieval and ranking.
We employ both monolingual and multilingual sentence encoders fine-tuned by a supervised method utilizing contrastive loss and NLI data.
Our results point to the fact that NLI fine-tuning increases the performance of the models in both tasks and both languages, with the potential to improve mono- and multilingual models.
arXiv Detail & Related papers (2023-08-06T12:40:58Z) - Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing [68.47787275021567]
Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data.
We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between latent variables using Optimal Transport.
arXiv Detail & Related papers (2023-07-09T04:52:31Z) - Investigating Lexical Replacements for Arabic-English Code-Switched Data
Augmentation [32.885722714728765]
We investigate data augmentation techniques for code-switching (CS) NLP systems.
We perform lexical replacements using word-aligned parallel corpora.
We compare these approaches against dictionary-based replacements.
arXiv Detail & Related papers (2022-05-25T10:44:36Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - Self-Training Sampling with Monolingual Data Uncertainty for Neural
Machine Translation [98.83925811122795]
We propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data.
We compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data.
Experimental results on large-scale WMT English$Rightarrow$German and English$Rightarrow$Chinese datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2021-06-02T05:01:36Z) - Self-Supervised Representations Improve End-to-End Speech Translation [57.641761472372814]
We show that self-supervised pre-trained features can consistently improve the translation performance.
Cross-lingual transfer allows to extend to a variety of languages without or with little tuning.
arXiv Detail & Related papers (2020-06-22T10:28:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.