TreeSwap: Data Augmentation for Machine Translation via Dependency
Subtree Swapping
- URL: http://arxiv.org/abs/2311.02355v1
- Date: Sat, 4 Nov 2023 09:27:40 GMT
- Title: TreeSwap: Data Augmentation for Machine Translation via Dependency
Subtree Swapping
- Authors: Attila Nagy, Dorina Lakatos, Botond Barta, Judit \'Acs
- Abstract summary: We introduce a novel augmentation method, which generates new sentences by swapping objects and subjects across bisentences.
TreeSwap achieves consistent improvements over baseline models in 4 language pairs in both directions on resource-constrained datasets.
We also explore domain-specific corpora, but find that our method does not make significant improvements on law, medical and IT data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data augmentation methods for neural machine translation are particularly
useful when limited amount of training data is available, which is often the
case when dealing with low-resource languages. We introduce a novel
augmentation method, which generates new sentences by swapping objects and
subjects across bisentences. This is performed simultaneously based on the
dependency parse trees of the source and target sentences. We name this method
TreeSwap. Our results show that TreeSwap achieves consistent improvements over
baseline models in 4 language pairs in both directions on resource-constrained
datasets. We also explore domain-specific corpora, but find that our method
does not make significant improvements on law, medical and IT data. We report
the scores of similar augmentation methods and find that TreeSwap performs
comparably. We also analyze the generated sentences qualitatively and find that
the augmentation produces a correct translation in most cases. Our code is
available on Github.
Related papers
- Data Augmentation for Code Translation with Comparable Corpora and Multiple References [21.754147577489764]
We build and analyze multiple types of comparable corpora, including programs generated from natural language documentation.
To reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data.
Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy.
arXiv Detail & Related papers (2023-11-01T06:01:22Z) - Data Augmentation for Machine Translation via Dependency Subtree
Swapping [0.0]
We present a generic framework for data augmentation via dependency subtree swapping.
We extract corresponding subtrees from the dependency parse trees of the source and target sentences and swap these across bisentences to create augmented samples.
We conduct resource-constrained experiments on 4 language pairs in both directions using the IWSLT text translation datasets and the Hunglish2 corpus.
arXiv Detail & Related papers (2023-07-13T19:00:26Z) - Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing [68.47787275021567]
Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data.
We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between latent variables using Optimal Transport.
arXiv Detail & Related papers (2023-07-09T04:52:31Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Syntax-driven Data Augmentation for Named Entity Recognition [3.0603554929274908]
In low resource settings, data augmentation strategies are commonly leveraged to improve performance.
We compare simple masked language model replacement and an augmentation method using constituency tree mutations to improve named entity recognition.
arXiv Detail & Related papers (2022-08-15T01:24:55Z) - TreeMix: Compositional Constituency-based Data Augmentation for Natural
Language Understanding [56.794981024301094]
We propose a compositional data augmentation approach for natural language understanding called TreeMix.
Specifically, TreeMix leverages constituency parsing tree to decompose sentences into constituent sub-structures and the Mixup data augmentation technique to recombine them to generate new sentences.
Compared with previous approaches, TreeMix introduces greater diversity to the samples generated and encourages models to learn compositionality of NLP data.
arXiv Detail & Related papers (2022-05-12T15:25:12Z) - LyS_ACoru\~na at SemEval-2022 Task 10: Repurposing Off-the-Shelf Tools
for Sentiment Analysis as Semantic Dependency Parsing [10.355938901584567]
This paper addresses the problem of structured sentiment analysis using a bi-affine semantic dependency.
For the monolingual setup, we considered: (i) training on a single treebank, and (ii) relaxing the setup by training on treebanks coming from different languages.
For the zero-shot setup and a given target treebank, we relied on: (i) a word-level translation of available treebanks in other languages to get noisy, unlikely-grammatical, but annotated data.
In the post-evaluation phase, we also trained cross-lingual models that simply merged all the English tree
arXiv Detail & Related papers (2022-04-27T10:21:28Z) - Incorporating Constituent Syntax for Coreference Resolution [50.71868417008133]
We propose a graph-based method to incorporate constituent syntactic structures.
We also explore to utilise higher-order neighbourhood information to encode rich structures in constituent trees.
Experiments on the English and Chinese portions of OntoNotes 5.0 benchmark show that our proposed model either beats a strong baseline or achieves new state-of-the-art performance.
arXiv Detail & Related papers (2022-02-22T07:40:42Z) - Cross-lingual Text Classification with Heterogeneous Graph Neural
Network [2.6936806968297913]
Cross-lingual text classification aims at training a classifier on the source language and transferring the knowledge to target languages.
Recent multilingual pretrained language models (mPLM) achieve impressive results in cross-lingual classification tasks.
We propose a simple yet effective method to incorporate heterogeneous information within and across languages for cross-lingual text classification.
arXiv Detail & Related papers (2021-05-24T12:45:42Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z) - Syntax-aware Data Augmentation for Neural Machine Translation [76.99198797021454]
We propose a novel data augmentation strategy for neural machine translation.
We set sentence-specific probability for word selection by considering their roles in sentence.
Our proposed method is evaluated on WMT14 English-to-German dataset and IWSLT14 German-to-English dataset.
arXiv Detail & Related papers (2020-04-29T13:45:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.