Automatic Evaluation Metrics for Document-level Translation: Overview, Challenges and Trends
- URL: http://arxiv.org/abs/2504.14804v1
- Date: Mon, 21 Apr 2025 02:08:42 GMT
- Title: Automatic Evaluation Metrics for Document-level Translation: Overview, Challenges and Trends
- Authors: Jiaxin GUO, Xiaoyu Chen, Zhiqiang Rao, Jinlong Yang, Zongyao Li, Hengchao Shang, Daimeng Wei, Hao Yang,
- Abstract summary: The paper first introduces the development status of document-level translation and the importance of evaluation.<n>It then provides a detailed analysis of the current state of automatic evaluation schemes and metrics.<n>The paper explores the challenges faced by current evaluation methods, such as the lack of reference diversity, dependence on sentence-level alignment information, and the bias, inaccuracy, and lack of interpretability.
- Score: 12.73291001580361
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: With the rapid development of deep learning technologies, the field of machine translation has witnessed significant progress, especially with the advent of large language models (LLMs) that have greatly propelled the advancement of document-level translation. However, accurately evaluating the quality of document-level translation remains an urgent issue. This paper first introduces the development status of document-level translation and the importance of evaluation, highlighting the crucial role of automatic evaluation metrics in reflecting translation quality and guiding the improvement of translation systems. It then provides a detailed analysis of the current state of automatic evaluation schemes and metrics, including evaluation methods with and without reference texts, as well as traditional metrics, Model-based metrics and LLM-based metrics. Subsequently, the paper explores the challenges faced by current evaluation methods, such as the lack of reference diversity, dependence on sentence-level alignment information, and the bias, inaccuracy, and lack of interpretability of the LLM-as-a-judge method. Finally, the paper looks ahead to the future trends in evaluation methods, including the development of more user-friendly document-level evaluation methods and more robust LLM-as-a-judge methods, and proposes possible research directions, such as reducing the dependency on sentence-level information, introducing multi-level and multi-granular evaluation approaches, and training models specifically for machine translation evaluation. This study aims to provide a comprehensive analysis of automatic evaluation for document-level translation and offer insights into future developments.
Related papers
- Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering [68.3400058037817]
We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality.<n>We show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations.
arXiv Detail & Related papers (2025-04-10T09:24:54Z) - Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy [52.261323452286554]
We introduce a method for contextual metric meta-evaluation by comparing the local metric accuracy of evaluation metrics.
Across translation, speech recognition, and ranking tasks, we demonstrate that the local metric accuracies vary both in absolute value and relative effectiveness as we shift across evaluation contexts.
arXiv Detail & Related papers (2025-03-25T16:42:25Z) - Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks [3.773596042872403]
Large language models (LLMs) continue to evolve, the need for robust and standardized evaluation benchmarks becomes paramount.
Various frameworks have emerged as noteworthy contributions to the field, offering comprehensive evaluation tests and benchmarks.
This paper provides an exploration and critical analysis of some of these evaluation methodologies, shedding light on their strengths, limitations, and impact on advancing the state-of-the-art in natural language processing.
arXiv Detail & Related papers (2024-07-29T03:37:14Z) - From Handcrafted Features to LLMs: A Brief Survey for Machine Translation Quality Estimation [20.64204462700532]
Machine Translation Quality Estimation (MTQE) is the task of estimating the quality of machine-translated text in real time without the need for reference translations.
This article provides a comprehensive overview of QE datasets, annotation methods, shared tasks, methodologies, challenges, and future research directions.
arXiv Detail & Related papers (2024-03-21T04:07:40Z) - Is Context Helpful for Chat Translation Evaluation? [23.440392979857247]
We conduct a meta-evaluation of existing sentence-level automatic metrics to assess the quality of machine-translated chats.
We find that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings.
We propose a new evaluation metric, Context-MQM, that utilizes bilingual context with a large language model.
arXiv Detail & Related papers (2024-03-13T07:49:50Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Lost in the Source Language: How Large Language Models Evaluate the Quality of Machine Translation [64.5862977630713]
This study investigates how Large Language Models (LLMs) leverage source and reference data in machine translation evaluation task.
We find that reference information significantly enhances the evaluation accuracy, while surprisingly, source information sometimes is counterproductive.
arXiv Detail & Related papers (2024-01-12T13:23:21Z) - LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores [23.568883428947494]
We investigate whether prominent LM-based evaluation metrics demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks.
Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries.
These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality.
arXiv Detail & Related papers (2023-11-16T10:43:26Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - An Overview on Machine Translation Evaluation [6.85316573653194]
Machine translation (MT) has become one of the important tasks of AI and development.
The evaluation task of MT is not only to evaluate the quality of machine translation, but also to give timely feedback to machine translation researchers.
This report mainly includes a brief history of machine translation evaluation (MTE), the classification of research methods on MTE, and the the cutting-edge progress.
arXiv Detail & Related papers (2022-02-22T16:58:28Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.