Automating Behavioral Testing in Machine Translation
- URL: http://arxiv.org/abs/2309.02553v3
- Date: Fri, 3 Nov 2023 00:13:25 GMT
- Title: Automating Behavioral Testing in Machine Translation
- Authors: Javier Ferrando, Matthias Sperber, Hendra Setiawan, Dominic Telaar,
Sa\v{s}a Hasan
- Abstract summary: We propose to use Large Language Models to generate source sentences tailored to test the behavior of Machine Translation models.
We can then verify whether the MT model exhibits the expected behavior through matching candidate sets.
Our approach aims to make behavioral testing of MT systems practical while requiring only minimal human effort.
- Score: 9.151054827967933
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Behavioral testing in NLP allows fine-grained evaluation of systems by
examining their linguistic capabilities through the analysis of input-output
behavior. Unfortunately, existing work on behavioral testing in Machine
Translation (MT) is currently restricted to largely handcrafted tests covering
a limited range of capabilities and languages. To address this limitation, we
propose to use Large Language Models (LLMs) to generate a diverse set of source
sentences tailored to test the behavior of MT models in a range of situations.
We can then verify whether the MT model exhibits the expected behavior through
matching candidate sets that are also generated using LLMs. Our approach aims
to make behavioral testing of MT systems practical while requiring only minimal
human effort. In our experiments, we apply our proposed evaluation framework to
assess multiple available MT systems, revealing that while in general
pass-rates follow the trends observable from traditional accuracy-based
metrics, our method was able to uncover several important differences and
potential bugs that go unnoticed when relying only on accuracy.
Related papers
- Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures.
We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z) - Towards General Error Diagnosis via Behavioral Testing in Machine
Translation [48.108393938462974]
This paper proposes a new framework for conducting behavioral testing of machine translation (MT) systems.
The core idea of BTPGBT is to employ a novel bilingual translation pair generation approach.
Experimental results on various MT systems demonstrate that BTPGBT could provide comprehensive and accurate behavioral testing results.
arXiv Detail & Related papers (2023-10-20T09:06:41Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level.
We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt)
This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z) - HilMeMe: A Human-in-the-Loop Machine Translation Evaluation Metric
Looking into Multi-Word Expressions [6.85316573653194]
We describe the design and implementation of a linguistically motivated human-in-the-loop evaluation metric looking into idiomatic and terminological Multi-word Expressions (MWEs)
MWEs can be used as one of the main factors to distinguish different MT systems by looking into their capabilities in recognising and translating MWEs in an accurate and meaning equivalent manner.
arXiv Detail & Related papers (2022-11-09T21:15:40Z) - A Probabilistic Framework for Mutation Testing in Deep Neural Networks [12.033944769247958]
We propose a Probabilistic Mutation Testing (PMT) approach that alleviates the inconsistency problem.
PMT effectively allows a more consistent and informed decision on mutations through evaluation.
arXiv Detail & Related papers (2022-08-11T19:45:14Z) - Can NMT Understand Me? Towards Perturbation-based Evaluation of NMT
Models for Code Generation [1.7616042687330642]
A key step to validate the robustness of the NMT models is to evaluate their performance on adversarial inputs.
In this work, we identify a set of perturbations and metrics tailored for the robustness assessment of such models.
We present a preliminary experimental evaluation, showing what type of perturbations affect the model the most.
arXiv Detail & Related papers (2022-03-29T08:01:39Z) - Variance-Aware Machine Translation Test Sets [19.973201669851626]
We release 70 small and discriminative test sets for machine translation (MT) evaluation called variance-aware test sets (VAT)
VAT is automatically created by a novel variance-aware filtering method that filters the indiscriminative test instances of the current MT test sets without any human labor.
arXiv Detail & Related papers (2021-11-07T13:18:59Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z) - As Easy as 1, 2, 3: Behavioural Testing of NMT Systems for Numerical
Translation [51.20569527047729]
Mistranslated numbers have the potential to cause serious effects, such as financial loss or medical misinformation.
We develop comprehensive assessments of the robustness of neural machine translation systems to numerical text via behavioural testing.
arXiv Detail & Related papers (2021-07-18T04:09:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.