An Overview on Machine Translation Evaluation
- URL: http://arxiv.org/abs/2202.11027v1
- Date: Tue, 22 Feb 2022 16:58:28 GMT
- Title: An Overview on Machine Translation Evaluation
- Authors: Lifeng Han
- Abstract summary: Machine translation (MT) has become one of the important tasks of AI and development.
The evaluation task of MT is not only to evaluate the quality of machine translation, but also to give timely feedback to machine translation researchers.
This report mainly includes a brief history of machine translation evaluation (MTE), the classification of research methods on MTE, and the the cutting-edge progress.
- Score: 6.85316573653194
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Since the 1950s, machine translation (MT) has become one of the important
tasks of AI and development, and has experienced several different periods and
stages of development, including rule-based methods, statistical methods, and
recently proposed neural network-based learning methods. Accompanying these
staged leaps is the evaluation research and development of MT, especially the
important role of evaluation methods in statistical translation and neural
translation research. The evaluation task of MT is not only to evaluate the
quality of machine translation, but also to give timely feedback to machine
translation researchers on the problems existing in machine translation itself,
how to improve and how to optimise. In some practical application fields, such
as in the absence of reference translations, the quality estimation of machine
translation plays an important role as an indicator to reveal the credibility
of automatically translated target languages. This report mainly includes the
following contents: a brief history of machine translation evaluation (MTE),
the classification of research methods on MTE, and the the cutting-edge
progress, including human evaluation, automatic evaluation, and evaluation of
evaluation methods (meta-evaluation). Manual evaluation and automatic
evaluation include reference-translation based and reference-translation
independent participation; automatic evaluation methods include traditional
n-gram string matching, models applying syntax and semantics, and deep learning
models; evaluation of evaluation methods includes estimating the credibility of
human evaluations, the reliability of the automatic evaluation, the reliability
of the test set, etc. Advances in cutting-edge evaluation methods include
task-based evaluation, using pre-trained language models based on big data, and
lightweight optimisation models using distillation techniques.
Related papers
- BiVert: Bidirectional Vocabulary Evaluation using Relations for Machine
Translation [4.651581292181871]
We propose a bidirectional semantic-based evaluation method designed to assess the sense distance of the translation from the source text.
This approach employs the comprehensive multilingual encyclopedic dictionary BabelNet.
Factual analysis shows a strong correlation between the average evaluation scores generated by our method and the human assessments across various machine translation systems for English-German language pair.
arXiv Detail & Related papers (2024-03-06T08:02:21Z) - Convergences and Divergences between Automatic Assessment and Human Evaluation: Insights from Comparing ChatGPT-Generated Translation and Neural Machine Translation [1.6982207802596105]
This study investigates the convergences and divergences between automated metrics and human evaluation.
To perform automatic assessment, four automated metrics are employed, while human evaluation incorporates the DQF-MQM error typology and six rubrics.
Results underscore the indispensable role of human judgment in evaluating the performance of advanced translation tools.
arXiv Detail & Related papers (2024-01-10T14:20:33Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Knowledge-Prompted Estimator: A Novel Approach to Explainable Machine
Translation Assessment [20.63045120292095]
Cross-lingual Machine Translation (MT) quality estimation plays a crucial role in evaluating translation performance.
GEMBA, the first MT quality assessment metric based on Large Language Models (LLMs), employs one-step prompting to achieve state-of-the-art (SOTA) in system-level MT quality estimation.
In this paper, we introduce Knowledge-Prompted Estor (KPE), a CoT prompting method that combines three one-step prompting techniques, including perplexity, token-level similarity, and sentence-level similarity.
arXiv Detail & Related papers (2023-06-13T01:18:32Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - Measuring Uncertainty in Translation Quality Evaluation (TQE) [62.997667081978825]
This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text.
The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
arXiv Detail & Related papers (2021-11-15T12:09:08Z) - Translation Quality Assessment: A Brief Survey on Manual and Automatic
Methods [9.210509295803243]
We present a high-level and concise survey of translation quality assessment (TQA) methods, including both manual judgement criteria and automated evaluation metrics.
We hope that this work will be an asset for both translation model researchers and quality assessment researchers.
arXiv Detail & Related papers (2021-05-05T18:28:10Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z) - Unsupervised Quality Estimation for Neural Machine Translation [63.38918378182266]
Existing approaches require large amounts of expert annotated data, computation and time for training.
We devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required.
We achieve very good correlation with human judgments of quality, rivalling state-of-the-art supervised QE models.
arXiv Detail & Related papers (2020-05-21T12:38:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.