SemMT: A Semantic-based Testing Approach for Machine Translation Systems
- URL: http://arxiv.org/abs/2012.01815v1
- Date: Thu, 3 Dec 2020 10:42:56 GMT
- Title: SemMT: A Semantic-based Testing Approach for Machine Translation Systems
- Authors: Jialun Cao and Meiziniu Li and Yeting Li and Ming Wen and Shing-Chi
Cheung
- Abstract summary: We propose SemMT, an automatic testing approach for machine translation systems based on semantic similarity checking.
SemMT applies round-trip translation and measures the semantic similarity between the original and translated sentences.
We show SemMT can achieve higher effectiveness compared with state-of-the-art works.
- Score: 11.166336490280749
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine translation has wide applications in daily life. In mission-critical
applications such as translating official documents, incorrect translation can
have unpleasant or sometimes catastrophic consequences. This motivates recent
research on testing methodologies for machine translation systems. Existing
methodologies mostly rely on metamorphic relations designed at the textual
level (e.g., Levenshtein distance) or syntactic level (e.g., the distance
between grammar structures) to determine the correctness of translation
results. However, these metamorphic relations do not consider whether the
original and translated sentences have the same meaning (i.e., Semantic
similarity). Therefore, in this paper, we propose SemMT, an automatic testing
approach for machine translation systems based on semantic similarity checking.
SemMT applies round-trip translation and measures the semantic similarity
between the original and translated sentences. Our insight is that the
semantics expressed by the logic and numeric constraint in sentences can be
captured using regular expressions (or deterministic finite automata) where
efficient equivalence/similarity checking algorithms are available. Leveraging
the insight, we propose three semantic similarity metrics and implement them in
SemMT. The experiment result reveals SemMT can achieve higher effectiveness
compared with state-of-the-art works, achieving an increase of 21% and 23% on
accuracy and F-Score, respectively. We also explore potential improvements that
can be achieved when proper combinations of metrics are adopted. Finally, we
discuss a solution to locate the suspicious trip in round-trip translation,
which may shed lights on further exploration.
Related papers
- An approach for mistranslation removal from popular dataset for Indic MT
Task [5.4755933832880865]
We propose an algorithm to remove mistranslations from the training corpus and evaluate its performance and efficiency.
Two Indic languages (ILs), namely, Hindi (HIN) and Odia (ODI) are chosen for the experiment.
The quality of the translations in the experiment is evaluated using standard metrics such as BLEU, METEOR, and RIBES.
arXiv Detail & Related papers (2024-01-12T06:37:19Z) - Crossing the Threshold: Idiomatic Machine Translation through Retrieval
Augmentation and Loss Weighting [66.02718577386426]
We provide a simple characterization of idiomatic translation and related issues.
We conduct a synthetic experiment revealing a tipping point at which transformer-based machine translation models correctly default to idiomatic translations.
To improve translation of natural idioms, we introduce two straightforward yet effective techniques.
arXiv Detail & Related papers (2023-10-10T23:47:25Z) - Towards Effective Disambiguation for Machine Translation with Large
Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences"
Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z) - MuLER: Detailed and Scalable Reference-based Evaluation [24.80921931416632]
We propose a novel methodology that transforms any reference-based evaluation metric for text generation into a fine-grained analysis tool.
Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types.
We perform experiments in both synthetic and naturalistic settings to support MuLER's validity and showcase its usability.
arXiv Detail & Related papers (2023-05-24T10:26:13Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - NMTScore: A Multilingual Analysis of Translation-based Text Similarity
Measures [42.46681912294797]
We analyze translation-based similarity measures in the common framework of multilingual NMT.
Compared to baselines such as sentence embeddings, translation-based measures prove competitive in paraphrase identification.
Measures show a relatively high correlation to human judgments.
arXiv Detail & Related papers (2022-04-28T17:57:17Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.