Word Closure-Based Metamorphic Testing for Machine Translation
- URL: http://arxiv.org/abs/2312.12056v2
- Date: Mon, 22 Jul 2024 14:17:09 GMT
- Title: Word Closure-Based Metamorphic Testing for Machine Translation
- Authors: Xiaoyuan Xie, Shuo Jin, Songqiang Chen, Shing-Chi Cheung,
- Abstract summary: We propose a word closure-based output comparison method to address the limitations of the existing Machine Translation Systems (MTS) MT methods.
Our method significantly outperforms the existing works in violation identification by improving the precision and recall.
It also helps to increase the F1 score of translation error localization by 35.9%.
- Score: 8.009584342926646
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the wide application of machine translation, the testing of Machine Translation Systems (MTSs) has attracted much attention. Recent works apply Metamorphic Testing (MT) to address the oracle problem in MTS testing. Existing MT methods for MTS generally follow the workflow of input transformation and output relation comparison, which generates a follow-up input sentence by mutating the source input and compares the source and follow-up output translations to detect translation errors, respectively. These methods use various input transformations to generate test case pairs and have successfully triggered numerous translation errors. However, they have limitations in performing fine-grained and rigorous output relation comparison and thus may report many false alarms and miss many true errors. In this paper, we propose a word closure-based output comparison method to address the limitations of the existing MTS MT methods. We first propose word closure as a new comparison unit, where each closure includes a group of correlated input and output words in the test case pair. Word closures suggest the linkages between the appropriate fragment in the source output translation and its counterpart in the follow-up output for comparison. Next, we compare the semantics on the level of word closure to identify the translation errors. In this way, we perform a fine-grained and rigorous semantic comparison for the outputs and thus realize more effective violation identification. We evaluate our method with the test cases generated by five existing input transformations and the translation outputs from three popular MTSs. Results show that our method significantly outperforms the existing works in violation identification by improving the precision and recall and achieving an average increase of 29.9% in F1 score. It also helps to increase the F1 score of translation error localization by 35.9%.
Related papers
- Multilingual Contrastive Decoding via Language-Agnostic Layers Skipping [60.458273797431836]
Decoding by contrasting layers (DoLa) is designed to improve the generation quality of large language models.
We find that this approach does not work well on non-English tasks.
Inspired by previous interpretability work on language transition during the model's forward pass, we propose an improved contrastive decoding algorithm.
arXiv Detail & Related papers (2024-07-15T15:14:01Z) - TasTe: Teaching Large Language Models to Translate through Self-Reflection [82.83958470745381]
Large language models (LLMs) have exhibited remarkable performance in various natural language processing tasks.
We propose the TasTe framework, which stands for translating through self-reflection.
The evaluation results in four language directions on the WMT22 benchmark reveal the effectiveness of our approach compared to existing methods.
arXiv Detail & Related papers (2024-06-12T17:21:21Z) - OTTAWA: Optimal TransporT Adaptive Word Aligner for Hallucination and Omission Translation Errors Detection [36.59354124910338]
Ottawa is a word aligner specifically designed to enhance the detection of hallucinations and omissions in Machine Translation systems.
Our approach yields competitive results compared to state-of-the-art methods across 18 language pairs on the HalOmi benchmark.
arXiv Detail & Related papers (2024-06-04T03:00:55Z) - Towards Effective Disambiguation for Machine Translation with Large
Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences"
Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z) - MuLER: Detailed and Scalable Reference-based Evaluation [24.80921931416632]
We propose a novel methodology that transforms any reference-based evaluation metric for text generation into a fine-grained analysis tool.
Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types.
We perform experiments in both synthetic and naturalistic settings to support MuLER's validity and showcase its usability.
arXiv Detail & Related papers (2023-05-24T10:26:13Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Rethink about the Word-level Quality Estimation for Machine Translation
from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words.
We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE.
The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z) - Mismatching-Aware Unsupervised Translation Quality Estimation For
Low-Resource Languages [6.049660810617423]
XLMRScore is a cross-lingual counterpart of BERTScore computed via the XLM-RoBERTa (XLMR) model.
We evaluate the proposed method on four low-resource language pairs of the WMT21 QE shared task.
arXiv Detail & Related papers (2022-07-31T16:23:23Z) - Principled Paraphrase Generation with Parallel Corpora [52.78059089341062]
We formalize the implicit similarity function induced by round-trip Machine Translation.
We show that it is susceptible to non-paraphrase pairs sharing a single ambiguous translation.
We design an alternative similarity metric that mitigates this issue.
arXiv Detail & Related papers (2022-05-24T17:22:42Z) - SemMT: A Semantic-based Testing Approach for Machine Translation Systems [11.166336490280749]
We propose SemMT, an automatic testing approach for machine translation systems based on semantic similarity checking.
SemMT applies round-trip translation and measures the semantic similarity between the original and translated sentences.
We show SemMT can achieve higher effectiveness compared with state-of-the-art works.
arXiv Detail & Related papers (2020-12-03T10:42:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.