Towards General Error Diagnosis via Behavioral Testing in Machine
Translation
- URL: http://arxiv.org/abs/2310.13362v1
- Date: Fri, 20 Oct 2023 09:06:41 GMT
- Title: Towards General Error Diagnosis via Behavioral Testing in Machine
Translation
- Authors: Junjie Wu, Lemao Liu, Dit-Yan Yeung
- Abstract summary: This paper proposes a new framework for conducting behavioral testing of machine translation (MT) systems.
The core idea of BTPGBT is to employ a novel bilingual translation pair generation approach.
Experimental results on various MT systems demonstrate that BTPGBT could provide comprehensive and accurate behavioral testing results.
- Score: 48.108393938462974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Behavioral testing offers a crucial means of diagnosing linguistic errors and
assessing capabilities of NLP models. However, applying behavioral testing to
machine translation (MT) systems is challenging as it generally requires human
efforts to craft references for evaluating the translation quality of such
systems on newly generated test cases. Existing works in behavioral testing of
MT systems circumvent this by evaluating translation quality without
references, but this restricts diagnosis to specific types of errors, such as
incorrect translation of single numeric or currency words. In order to diagnose
general errors, this paper proposes a new Bilingual Translation Pair Generation
based Behavior Testing (BTPGBT) framework for conducting behavioral testing of
MT systems. The core idea of BTPGBT is to employ a novel bilingual translation
pair generation (BTPG) approach that automates the construction of high-quality
test cases and their pseudoreferences. Experimental results on various MT
systems demonstrate that BTPGBT could provide comprehensive and accurate
behavioral testing results for general error diagnosis, which further leads to
several insightful findings. Our code and data are available at https:
//github.com/wujunjie1998/BTPGBT.
Related papers
- Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures.
We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z) - Test Generation Strategies for Building Failure Models and Explaining
Spurious Failures [4.995172162560306]
Test inputs fail not only when the system under test is faulty but also when the inputs are invalid or unrealistic.
We propose to build failure models for inferring interpretable rules on test inputs that cause spurious failures.
We show that our proposed surrogate-assisted approach generates failure models with an average accuracy of 83%.
arXiv Detail & Related papers (2023-12-09T18:36:15Z) - Automating Behavioral Testing in Machine Translation [9.151054827967933]
We propose to use Large Language Models to generate source sentences tailored to test the behavior of Machine Translation models.
We can then verify whether the MT model exhibits the expected behavior through matching candidate sets.
Our approach aims to make behavioral testing of MT systems practical while requiring only minimal human effort.
arXiv Detail & Related papers (2023-09-05T19:40:45Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - SALTED: A Framework for SAlient Long-Tail Translation Error Detection [17.914521288548844]
We introduce SALTED, a specifications-based framework for behavioral testing of machine translation models.
At the core of our approach is the development of high-precision detectors that flag errors between a source sentence and a system output.
We demonstrate that such detectors could be used not just to identify salient long-tail errors in MT systems, but also for higher-recall filtering of the training data.
arXiv Detail & Related papers (2022-05-20T06:45:07Z) - Variance-Aware Machine Translation Test Sets [19.973201669851626]
We release 70 small and discriminative test sets for machine translation (MT) evaluation called variance-aware test sets (VAT)
VAT is automatically created by a novel variance-aware filtering method that filters the indiscriminative test instances of the current MT test sets without any human labor.
arXiv Detail & Related papers (2021-11-07T13:18:59Z) - As Easy as 1, 2, 3: Behavioural Testing of NMT Systems for Numerical
Translation [51.20569527047729]
Mistranslated numbers have the potential to cause serious effects, such as financial loss or medical misinformation.
We develop comprehensive assessments of the robustness of neural machine translation systems to numerical text via behavioural testing.
arXiv Detail & Related papers (2021-07-18T04:09:47Z) - Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages.
We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data.
Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.