Fugu-MT 論文翻訳(概要): XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics

論文の概要: XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics

arxiv url: http://arxiv.org/abs/2604.14934v2
Date: Sun, 19 Apr 2026 06:01:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 13:51:31.187144
Title: XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics
Title（参考訳）: XQ-MEval: ベンチマーク翻訳メトリクスのための言語間並列品質データセット
Authors: Jingxuan Liu, Zhi Qu, Jin Tei, Hidetaka Kamigaito, Lemao Liu, Taro Watanabe,
Abstract要約: 9つの翻訳方向をカバーする半自動構築データセットであるXQ-MEvalを提案する。 MQMで定義されたエラーを金の翻訳に自動的に注入し、信頼性のためにネイティブスピーカーによってフィルタリングし、エラーをマージして、制御可能な品質で擬似翻訳を生成する。 XQ-MEvalを用いて, 平均的判断と人的判断の矛盾を明らかにする。
参考スコア（独自算出の注目度）: 64.77152900881724
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Automatic evaluation metrics are essential for building multilingual translation systems. The common practice of evaluating these systems is averaging metric scores across languages, yet this is suspicious since metrics may suffer from cross-lingual scoring bias, where translations of equal quality receive different scores across languages. This problem has not been systematically studied because no benchmark exists that provides parallel-quality instances across languages, and expert annotation is not realistic. In this work, we propose XQ-MEval, a semi-automatically built dataset covering nine translation directions, to benchmark translation metrics. Specifically, we inject MQM-defined errors into gold translations automatically, filter them by native speakers for reliability, and merge errors to generate pseudo translations with controllable quality. These pseudo translations are then paired with corresponding sources and references to form triplets used in assessing the qualities of translation metrics. Using XQ-MEval, our experiments on nine representative metrics reveal the inconsistency between averaging and human judgment and provide the first empirical evidence of cross-lingual scoring bias. Finally, we propose a normalization strategy derived from XQ-MEval that aligns score distributions across languages, improving the fairness and reliability of multilingual metric evaluation.
Abstract（参考訳）: 自動評価指標は多言語翻訳システムの構築に不可欠である。これらのシステムを評価する一般的な実践は、言語間でのメトリクススコアの平均化であるが、これは、同じ品質の翻訳が言語間で異なるスコアを受け取るという、言語間スコアのバイアスに悩まされる可能性があるため、疑わしい。言語間で並列品質のインスタンスを提供するベンチマークは存在せず、専門家のアノテーションは現実的ではないため、この問題は体系的に研究されていない。本研究では,9つの翻訳方向をカバーする半自動構築されたデータセットであるXQ-MEvalを提案する。具体的には、MQMで定義した誤りを金の翻訳に自動的に注入し、信頼性のためにネイティブスピーカーでフィルタリングし、エラーをマージして、制御可能な品質で擬似翻訳を生成する。これらの擬似翻訳は、対応するソースや参照と組み合わせて、翻訳指標の質を評価するために使用される三つ子を形成する。 XQ-MEvalを用いて、平均的判断と人的判断の矛盾を明らかにし、言語間のスコアリングバイアスの最初の実証的証拠を提供する。最後に,XQ-MEvalをベースとした正規化手法を提案する。

論文の概要: XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics

関連論文リスト