Align-then-Slide: A complete evaluation framework for Ultra-Long Document-Level Machine Translation
- URL: http://arxiv.org/abs/2509.03809v1
- Date: Thu, 04 Sep 2025 01:50:20 GMT
- Title: Align-then-Slide: A complete evaluation framework for Ultra-Long Document-Level Machine Translation
- Authors: Jiaxin Guo, Daimeng Wei, Yuanchang Luo, Xiaoyu Chen, Zhanglin Wu, Huan Yang, Hengchao Shang, Zongyao Li, Zhiqiang Rao, Jinlong Yang, Hao Yang,
- Abstract summary: We introduce textittextbfAlign-then-Slide, a complete evaluation framework for ultra-long doc-mt.<n>In the Align stage, we automatically infer sentence-level source-target correspondences and rebuild the target to match the source sentence number.<n>In the n-Chunk Sliding Evaluate stage, we calculate averaged metric scores under 1-, 2-, 3- and 4-chunk for multi-granularity assessment.
- Score: 26.418216341998953
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) have ushered in a new era for document-level machine translation (\textit{doc}-mt), yet their whole-document outputs challenge existing evaluation methods that assume sentence-by-sentence alignment. We introduce \textit{\textbf{Align-then-Slide}}, a complete evaluation framework for ultra-long doc-mt. In the Align stage, we automatically infer sentence-level source-target correspondences and rebuild the target to match the source sentence number, resolving omissions and many-to-one/one-to-many mappings. In the n-Chunk Sliding Evaluate stage, we calculate averaged metric scores under 1-, 2-, 3- and 4-chunk for multi-granularity assessment. Experiments on the WMT benchmark show a Pearson correlation of 0.929 between our method with expert MQM rankings. On a newly curated real-world test set, our method again aligns closely with human judgments. Furthermore, preference data produced by Align-then-Slide enables effective CPO training and its direct use as a reward model for GRPO, both yielding translations preferred over a vanilla SFT baseline. The results validate our framework as an accurate, robust, and actionable evaluation tool for doc-mt systems.
Related papers
- Extending Automatic Machine Translation Evaluation to Book-Length Documents [69.84659107448768]
SEGALE is an evaluation scheme that extends existing automatic metrics to long-document translation.<n>Our approach enables previously unattainable document-level evaluation.<n> Experiments show our scheme significantly outperforms existing long-form document evaluation schemes.
arXiv Detail & Related papers (2025-09-21T21:46:58Z) - HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation [38.67031685302134]
HiMATE is a Hierarchical Multi-Agent Framework for Machine Translation Evaluation.<n>We develop a hierarchical multi-agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors.<n> Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human-aligned evaluations.
arXiv Detail & Related papers (2025-05-22T06:24:08Z) - Fine-Grained and Multi-Dimensional Metrics for Document-Level Machine Translation [15.987448306012167]
Large language models (LLMs) have excelled in various NLP tasks, including machine translation (MT)<n>This work investigates the inherent capability of instruction-tuned LLMs for document-level translation (docMT)
arXiv Detail & Related papers (2024-10-28T11:49:58Z) - MT-Ranker: Reference-free machine translation evaluation by inter-system
ranking [14.188948302661933]
We show that MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21.
MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines.
arXiv Detail & Related papers (2024-01-30T15:30:03Z) - Unify word-level and span-level tasks: NJUNLP's Participation for the
WMT2023 Quality Estimation Shared Task [59.46906545506715]
We introduce the NJUNLP team to the WMT 2023 Quality Estimation (QE) shared task.
Our team submitted predictions for the English-German language pair on all two sub-tasks.
Our models achieved the best results in English-German for both word-level and fine-grained error span detection sub-tasks.
arXiv Detail & Related papers (2023-09-23T01:52:14Z) - Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind.
A final factuality score is computed by an adjustable scoring mechanism.
Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z) - Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level.
We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt)
This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Alibaba-Translate China's Submission for WMT 2022 Quality Estimation
Shared Task [80.22825549235556]
We present our submission to the sentence-level MQM benchmark at Quality Estimation Shared Task, named UniTE.
Specifically, our systems employ the framework of UniTE, which combined three types of input formats during training with a pre-trained language model.
Results show that our models reach 1st overall ranking in the Multilingual and English-Russian settings, and 2nd overall ranking in English-German and Chinese-English settings.
arXiv Detail & Related papers (2022-10-18T08:55:27Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - Few-Shot Document-Level Relation Extraction [0.0]
We present document-level relation extraction benchmark (FSDLRE)
We argue that document-level corpora provide more realism, particularly regarding none-of-the-above (NOTA) distributions.
We adapt the state-of-the-art sentence-level method MNAV to the document-level and develop it further for improved domain adaptation.
arXiv Detail & Related papers (2022-05-04T13:16:19Z) - Using Context in Neural Machine Translation Training Objectives [23.176247496139574]
We present Neural Machine Translation (NMT) training using document-level metrics with batch-level documents.
We demonstrate that training is more robust for document-level metrics than with sequence metrics.
arXiv Detail & Related papers (2020-05-04T13:42:30Z) - Towards Making the Most of Context in Neural Machine Translation [112.9845226123306]
We argue that previous research did not make a clear use of the global context.
We propose a new document-level NMT framework that deliberately models the local context of each sentence.
arXiv Detail & Related papers (2020-02-19T03:30:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.