Knowledge-Prompted Estimator: A Novel Approach to Explainable Machine
Translation Assessment
- URL: http://arxiv.org/abs/2306.07486v1
- Date: Tue, 13 Jun 2023 01:18:32 GMT
- Title: Knowledge-Prompted Estimator: A Novel Approach to Explainable Machine
Translation Assessment
- Authors: Hao Yang, Min Zhang, Shimin Tao, Minghan Wang, Daimeng Wei, Yanfei
Jiang
- Abstract summary: Cross-lingual Machine Translation (MT) quality estimation plays a crucial role in evaluating translation performance.
GEMBA, the first MT quality assessment metric based on Large Language Models (LLMs), employs one-step prompting to achieve state-of-the-art (SOTA) in system-level MT quality estimation.
In this paper, we introduce Knowledge-Prompted Estor (KPE), a CoT prompting method that combines three one-step prompting techniques, including perplexity, token-level similarity, and sentence-level similarity.
- Score: 20.63045120292095
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-lingual Machine Translation (MT) quality estimation plays a crucial
role in evaluating translation performance. GEMBA, the first MT quality
assessment metric based on Large Language Models (LLMs), employs one-step
prompting to achieve state-of-the-art (SOTA) in system-level MT quality
estimation; however, it lacks segment-level analysis. In contrast,
Chain-of-Thought (CoT) prompting outperforms one-step prompting by offering
improved reasoning and explainability. In this paper, we introduce
Knowledge-Prompted Estimator (KPE), a CoT prompting method that combines three
one-step prompting techniques, including perplexity, token-level similarity,
and sentence-level similarity. This method attains enhanced performance for
segment-level estimation compared with previous deep learning models and
one-step prompting approaches. Furthermore, supplementary experiments on
word-level visualized alignment demonstrate that our KPE method significantly
improves token alignment compared with earlier models and provides better
interpretability for MT quality estimation. Code will be released upon
publication.
Related papers
- BEExAI: Benchmark to Evaluate Explainable AI [0.9176056742068812]
We propose BEExAI, a benchmark tool that allows large-scale comparison of different post-hoc XAI methods.
We argue that the need for a reliable way of measuring the quality and correctness of explanations is becoming critical.
arXiv Detail & Related papers (2024-07-29T11:21:17Z) - BiVert: Bidirectional Vocabulary Evaluation using Relations for Machine
Translation [4.651581292181871]
We propose a bidirectional semantic-based evaluation method designed to assess the sense distance of the translation from the source text.
This approach employs the comprehensive multilingual encyclopedic dictionary BabelNet.
Factual analysis shows a strong correlation between the average evaluation scores generated by our method and the human assessments across various machine translation systems for English-German language pair.
arXiv Detail & Related papers (2024-03-06T08:02:21Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level.
We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt)
This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z) - Discourse Cohesion Evaluation for Document-Level Neural Machine
Translation [36.96887050831173]
It is well known that translations generated by an excellent document-level neural machine translation (NMT) model are consistent and coherent.
Existing sentence-level evaluation metrics like BLEU can hardly reflect the model's performance at the document level.
We propose a new test suite that considers four cohesive manners to measure the cohesiveness of document translations.
arXiv Detail & Related papers (2022-08-19T01:56:00Z) - An Overview on Machine Translation Evaluation [6.85316573653194]
Machine translation (MT) has become one of the important tasks of AI and development.
The evaluation task of MT is not only to evaluate the quality of machine translation, but also to give timely feedback to machine translation researchers.
This report mainly includes a brief history of machine translation evaluation (MTE), the classification of research methods on MTE, and the the cutting-edge progress.
arXiv Detail & Related papers (2022-02-22T16:58:28Z) - QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z) - On Learning Text Style Transfer with Direct Rewards [101.97136885111037]
Lack of parallel corpora makes it impossible to directly train supervised models for the text style transfer task.
We leverage semantic similarity metrics originally used for fine-tuning neural machine translation models.
Our model provides significant gains in both automatic and human evaluation over strong baselines.
arXiv Detail & Related papers (2020-10-24T04:30:02Z) - Unsupervised Quality Estimation for Neural Machine Translation [63.38918378182266]
Existing approaches require large amounts of expert annotated data, computation and time for training.
We devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required.
We achieve very good correlation with human judgments of quality, rivalling state-of-the-art supervised QE models.
arXiv Detail & Related papers (2020-05-21T12:38:06Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z) - Revisiting Round-Trip Translation for Quality Estimation [0.0]
Quality estimation (QE) is the task of automatically evaluating the quality of translations without human-translated references.
In this paper, we employ semantic embeddings to RTT-based QE.
Our method achieves the highest correlations with human judgments, compared to previous WMT 2019 quality estimation metric task submissions.
arXiv Detail & Related papers (2020-04-29T03:20:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.