Fugu-MT 論文翻訳(概要): Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

論文の概要: Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

arxiv url: http://arxiv.org/abs/2605.01630v1
Date: Sat, 02 May 2026 22:44:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.858274
Title: Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese
Title（参考訳）: Prosa: ブラジルポルトガル語のリアルユーザチャットにおけるLLMの評価
Authors: Roseval Malaquias Junior, Giovana Kerche Bonás, Thales Sales Almeida, Hugo Abonizio, Thiago Laitz, Ramon Pires, Marcos Piau, Celio Larcher, Rodrigo Nogueira,
Abstract要約: Prosaはブラジル初のマルチターンポルトガル語チャットベンチマークである。 3人の審査員は16位のうち1つに同意する一方、総投票では16位のうち7つに同意する。我々は、将来のモデルを同一条件下で評価できるように、ベンチマークとフィルタリングコードをリリースする。
参考スコア（独自算出の注目度）: 8.678622777553267
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Rankings produced by holistic LLM-as-a-judge scoring are sensitive to the bias of the chosen judge model. We show that switching to binary rubric scoring with multi-judge filtering removes this sensitivity: decomposing the judgement matters more than the judge model itself. To support this claim, we introduce Prosa, the first real user multi-turn Brazilian Portuguese chat benchmark: 1,000 WildChat conversations scored by three judges from three model families on 16 models. Under filtered rubric scoring the three judges agree on every one of the 16 ranks, whereas under holistic scoring they agree on only 7 of 16. Additionally, the rubric filtering pipeline increases the average score gap between neighbouring models by 47%, thereby improving Prosa's discriminative power. Evaluating a new model on Prosa costs approximately $2.1 when using Gemini 3 Flash as the judge. We release the benchmark and the filtering code to ensure that future models can be assessed under identical conditions. These artifacts also make our rubric-based scoring method reusable beyond Prosa, supporting other open-ended evaluation settings.
Abstract（参考訳）: 総合的LLM-as-a-judgeスコアによって生成されるランキングは、選択された審査モデルのバイアスに敏感である。マルチジャッジフィルタリングによる二項ルーブリックスコアへの切り替えは、この感度を排除し、判断を分解することが判断モデル自体よりも重要であることを示す。この主張を支持するために,ブラジル初のマルチターンポルトガル語チャットベンチマークであるProsaを紹介した。 3人の審査員は16位のうち1つに同意する一方、総投票では16位のうち7つに同意する。さらに、潤滑フィルターパイプラインは近隣のモデル間の平均スコアギャップを47%増加させ、プロサの識別能力を向上させる。 Prosaの新しいモデルを評価するには、審査員としてGemini 3 Flashを使用すると約2.1ドルかかる。我々は、将来のモデルを同一条件下で評価できるように、ベンチマークとフィルタリングコードをリリースする。これらのアーティファクトは、私たちのルーリックベースのスコアリングメソッドをProsaを超えて再利用し、他のオープンな評価設定をサポートします。

論文の概要: Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

関連論文リスト