There's no Data Like Better Data: Using QE Metrics for MT Data Filtering
- URL: http://arxiv.org/abs/2311.05350v1
- Date: Thu, 9 Nov 2023 13:21:34 GMT
- Title: There's no Data Like Better Data: Using QE Metrics for MT Data Filtering
- Authors: Jan-Thorsten Peter, David Vilar, Daniel Deutsch, Mara Finkelstein,
Juraj Juraska, Markus Freitag
- Abstract summary: We analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems(NMT)
We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half.
- Score: 25.17221095970304
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Quality Estimation (QE), the evaluation of machine translation output without
the need of explicit references, has seen big improvements in the last years
with the use of neural metrics. In this paper we analyze the viability of using
QE metrics for filtering out bad quality sentence pairs in the training data of
neural machine translation systems~(NMT). While most corpus filtering methods
are focused on detecting noisy examples in collections of texts, usually huge
amounts of web crawled data, QE models are trained to discriminate more
fine-grained quality differences. We show that by selecting the highest quality
sentence pairs in the training data, we can improve translation quality while
reducing the training size by half. We also provide a detailed analysis of the
filtering results, which highlights the differences between both approaches.
Related papers
- ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws [67.59263833387536]
ScalingFilter is a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data.
To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations.
arXiv Detail & Related papers (2024-08-15T17:59:30Z) - Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions.
We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
arXiv Detail & Related papers (2024-07-03T17:04:17Z) - APE-then-QE: Correcting then Filtering Pseudo Parallel Corpora for MT
Training Data Creation [48.47548479232714]
We propose a repair-filter-use methodology that uses an APE system to correct errors on the target side of the Machine Translation training data.
We select the sentence pairs from the original and corrected sentence pairs based on the quality scores computed using a Quality Estimation (QE) model.
We observe an improvement in the Machine Translation system's performance by 5.64 and 9.91 BLEU points, for English-Marathi and Marathi-English, over the baseline model.
arXiv Detail & Related papers (2023-12-18T16:06:18Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Machine Translation Impact in E-commerce Multilingual Search [0.0]
Cross-lingual information retrieval correlates highly with the quality of Machine Translation.
There may be a threshold beyond which improving query translation quality yields little or no benefit to further improve the retrieval performance.
arXiv Detail & Related papers (2023-01-31T21:59:35Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Separating Grains from the Chaff: Using Data Filtering to Improve
Multilingual Translation for Low-Resourced African Languages [0.6947064688250465]
This work describes our approach, which is based on filtering the given noisy data using a sentence-pair classifier.
We empirically validate our approach by evaluating on two common datasets and show that data filtering generally improves overall translation quality.
arXiv Detail & Related papers (2022-10-19T16:12:27Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Selecting Backtranslated Data from Multiple Sources for Improved Neural
Machine Translation [8.554761233491236]
We analyse the impact that data translated with rule-based, phrase-based statistical and neural MT systems has on new MT systems.
We exploit different data selection strategies in order to reduce the amount of data used, while at the same time maintaining high-quality MT systems.
arXiv Detail & Related papers (2020-05-01T10:50:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.