Fugu-MT 論文翻訳(概要): Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation

論文の概要: Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation

arxiv url: http://arxiv.org/abs/2604.05083v1
Date: Mon, 06 Apr 2026 18:36:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-08 17:42:09.442889
Title: Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation
Title（参考訳）: LLM-as-a-Judgeを超えて:多言語生成テキスト評価のための決定論的指標
Authors: Firoj Alam, Gagan Bhatia, Sahinur Rahman Laskar, Shammur Absar Chowdhury,
Abstract要約: 大規模言語モデル(LLM)は、生成したテキストを評価するための自動判断器として、ますます採用されている。我々は、相補的で決定論的に学習されたメトリクスのファミリーである textbftextit OmniScore を提案する。大規模総合管理モデル(sim$564k, textbf107 言語)を訓練し,手動で8,617 の注釈付きインスタンスを用いて評価した。
参考スコア（独自算出の注目度）: 20.309826321619482
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: While Large Language Models (LLMs) are increasingly adopted as automated judges for evaluating generated text, their outputs are often costly, and highly sensitive to prompt design, language, and aggregation strategies, severely, which limits reproducibility. To address these challenges, we propose \textbf{\textit{OmniScore}}, a family of complementary, deterministic learned metrics developed using small size ($<$1B) parameter models. OmniScore approximates LLM-judge behavior while preserving the low latency and consistency of traditional model-based scoring. We trained the models large-scale synthetic supervision ($\sim$564k instances, in \textbf{107 languages}) and evaluated using 8,617 manually annotated instances. The OmniScore family supports reliable, multi-dimensional scores across a variety of settings, including reference-based, source-grounded, and hybrid evaluations. We evaluate these models across question answering (QA), translation, and summarization in \textbf{6 languages}. Our results demonstrate that lightweight, deterministic learned metrics provide a highly practical and scalable alternative to frontier LLMs. Our models and datasets can be found at https://huggingface.co/collections/QCRI/omniscore
Abstract（参考訳）: 大規模言語モデル(LLM)は、生成したテキストを評価する自動化判断器としてますます採用されているが、そのアウトプットはコストが高く、設計、言語、集約戦略に非常に敏感であり、再現性を著しく制限している。これらの課題に対処するために,小ささ ($1B) のパラメータモデルを用いて開発した相補的決定論的学習指標群である \textbf{\textit{OmniScore}} を提案する。 OmniScoreは、従来のモデルベースのスコアリングの低レイテンシと一貫性を維持しながら、LCM-judgeの挙動を近似する。我々は,大規模合成管理モデル(XMLbf{107 言語で 564k のインスタンス)を訓練し,手動で8,617 のアノテーション付きインスタンスを用いて評価した。 OmniScoreファミリは、参照ベース、ソースグラウンド、ハイブリッド評価など、さまざまな設定で信頼性の高い多次元スコアをサポートする。質問応答 (QA) , 翻訳, 要約を含むこれらのモデルを, textbf{6 言語で評価する。我々の結果は,軽量で決定論的に学習されたメトリクスが,フロンティアのLLMに代わる,非常に実用的でスケーラブルな代替手段を提供することを示した。私たちのモデルとデータセットはhttps://huggingface.co/collections/QCRI/omniscoreで確認できます。

論文の概要: Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation

関連論文リスト