Original or Translated? On the Use of Parallel Data for Translation
Quality Estimation
- URL: http://arxiv.org/abs/2212.10257v1
- Date: Tue, 20 Dec 2022 14:06:45 GMT
- Title: Original or Translated? On the Use of Parallel Data for Translation
Quality Estimation
- Authors: Baopu Qiu, Liang Ding, Di Wu, Lin Shang, Yibing Zhan, Dacheng Tao
- Abstract summary: We demonstrate a significant gap between parallel data and real QE data.
For parallel data, it is indiscriminate and the translationese may occur on either source or target side.
We find that using the source-original part of parallel corpus consistently outperforms its target-original counterpart.
- Score: 81.27850245734015
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Machine Translation Quality Estimation (QE) is the task of evaluating
translation output in the absence of human-written references. Due to the
scarcity of human-labeled QE data, previous works attempted to utilize the
abundant unlabeled parallel corpora to produce additional training data with
pseudo labels. In this paper, we demonstrate a significant gap between parallel
data and real QE data: for QE data, it is strictly guaranteed that the source
side is original texts and the target side is translated (namely
translationese). However, for parallel data, it is indiscriminate and the
translationese may occur on either source or target side. We compare the impact
of parallel data with different translation directions in QE data augmentation,
and find that using the source-original part of parallel corpus consistently
outperforms its target-original counterpart. Moreover, since the WMT corpus
lacks direction information for each parallel sentence, we train a classifier
to distinguish source- and target-original bitext, and carry out an analysis of
their difference in both style and domain. Together, these findings suggest
using source-original parallel data for QE data augmentation, which brings a
relative improvement of up to 4.0% and 6.4% compared to undifferentiated data
on sentence- and word-level QE tasks respectively.
Related papers
- Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data [13.587157318352869]
We propose a two-phase training approach where pre-trained large language models are continually pre-trained on parallel data.
We evaluate these methods on thirteen test sets for Japanese-to-English and English-to-Japanese translation.
arXiv Detail & Related papers (2024-07-03T14:23:36Z) - APE-then-QE: Correcting then Filtering Pseudo Parallel Corpora for MT
Training Data Creation [48.47548479232714]
We propose a repair-filter-use methodology that uses an APE system to correct errors on the target side of the Machine Translation training data.
We select the sentence pairs from the original and corrected sentence pairs based on the quality scores computed using a Quality Estimation (QE) model.
We observe an improvement in the Machine Translation system's performance by 5.64 and 9.91 BLEU points, for English-Marathi and Marathi-English, over the baseline model.
arXiv Detail & Related papers (2023-12-18T16:06:18Z) - Data Augmentation for Code Translation with Comparable Corpora and Multiple References [21.754147577489764]
We build and analyze multiple types of comparable corpora, including programs generated from natural language documentation.
To reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data.
Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy.
arXiv Detail & Related papers (2023-11-01T06:01:22Z) - Translating away Translationese without Parallel Data [14.423809260672877]
Translated texts exhibit systematic linguistic differences compared to original texts in the same language.
In this paper, we explore a novel approach to reduce translationese in translated texts: translation-based style transfer.
We show how we can eliminate the need for parallel validation data by combining the self-supervised loss with an unsupervised loss.
arXiv Detail & Related papers (2023-10-28T22:11:25Z) - Rethink about the Word-level Quality Estimation for Machine Translation
from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words.
We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE.
The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z) - Bridging the Data Gap between Training and Inference for Unsupervised
Neural Machine Translation [49.916963624249355]
A UNMT model is trained on the pseudo parallel data with translated source, and natural source sentences in inference.
The source discrepancy between training and inference hinders the translation performance of UNMT models.
We propose an online self-training approach, which simultaneously uses the pseudo parallel data natural source, translated target to mimic the inference scenario.
arXiv Detail & Related papers (2022-03-16T04:50:27Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - Meta Back-translation [111.87397401837286]
We propose a novel method to generate pseudo-parallel data from a pre-trained back-translation model.
Our method is a meta-learning algorithm which adapts a pre-trained back-translation model so that the pseudo-parallel data it generates would train a forward-translation model to do well on a validation set.
arXiv Detail & Related papers (2021-02-15T20:58:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.