An Empirical Study of Automatic Post-Editing
- URL: http://arxiv.org/abs/2209.07759v1
- Date: Fri, 16 Sep 2022 07:38:27 GMT
- Title: An Empirical Study of Automatic Post-Editing
- Authors: Xu Zhang and Xiaojun Wan
- Abstract summary: APE aims to reduce manual post-editing efforts by automatically correcting errors in machine-translated output.
To alleviate the lack of genuine training data, most of the current APE systems employ data augmentation methods to generate large-scale artificial corpora.
We study the outputs of the state-of-art APE model on a difficult APE dataset to analyze the problems in existing APE systems.
- Score: 56.86393786396992
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic post-editing (APE) aims to reduce manual post-editing efforts by
automatically correcting errors in machine-translated output. Due to the
limited amount of human-annotated training data, data scarcity is one of the
main challenges faced by all APE systems. To alleviate the lack of genuine
training data, most of the current APE systems employ data augmentation methods
to generate large-scale artificial corpora. In view of the importance of data
augmentation in APE, we separately study the impact of the construction method
of artificial corpora and artificial data domain on the performance of APE
models. Moreover, the difficulty of APE varies between different machine
translation (MT) systems. We study the outputs of the state-of-art APE model on
a difficult APE dataset to analyze the problems in existing APE systems.
Primarily, we find that 1) Artificial corpora with high-quality source text and
machine-translated text more effectively improve the performance of APE models;
2) In-domain artificial training data can better improve the performance of APE
models, while irrelevant out-of-domain data actually interfere with the model;
3) Existing APE model struggles with cases containing long source text or
high-quality machine-translated text; 4) The state-of-art APE model works well
on grammatical and semantic addition problems, but the output is prone to
entity and semantic omission errors.
Related papers
- How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples.
We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics.
When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z) - FairFlow: An Automated Approach to Model-based Counterfactual Data Augmentation For NLP [7.41244589428771]
This paper proposes FairFlow, an automated approach to generating parallel data for training counterfactual text generator models.
We show that FairFlow significantly overcomes the limitations of dictionary-based word-substitution approaches whilst maintaining good performance.
arXiv Detail & Related papers (2024-07-23T12:29:37Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - Bring More Attention to Syntactic Symmetry for Automatic Postediting of
High-Quality Machine Translations [4.217162744375792]
We propose a linguistically motivated method of regularization that is expected to enhance APE models' understanding of the target language.
Our analysis of experimental results demonstrates that the proposed method helps improving the state-of-the-art architecture's APE quality for high-quality MTs.
arXiv Detail & Related papers (2023-05-17T20:25:19Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Can Automatic Post-Editing Improve NMT? [9.233407096706744]
Automatic post-editing (APE) aims to improve machine translations, thereby reducing human post-editing effort.
APE has had notable success when used with statistical machine translation (SMT) systems but has not been as successful over neural machine translation (NMT) systems.
arXiv Detail & Related papers (2020-09-30T02:34:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.