Fugu-MT 論文翻訳(概要): Grading the Grader: Lessons from Evaluating an Agentic Data Analysis System

論文の概要: Grading the Grader: Lessons from Evaluating an Agentic Data Analysis System

arxiv url: http://arxiv.org/abs/2606.24839v1
Date: Tue, 23 Jun 2026 17:18:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:16:49.130481
Title: Grading the Grader: Lessons from Evaluating an Agentic Data Analysis System
Title（参考訳）: グレーディング・ザ・グレーダ:エージェントデータ分析システムの評価から学んだこと
Authors: Tian Zheng, Kai-Tai Hsu,
Abstract要約: エージェントデータ分析システムは、コード、数値結果、言語診断を含む豊富な出力を生成する。したがって、エージェントの出力と根本的真正解との真に不一致を分解物と区別する必要がある。 DSGymから153個の数値QRDataタスクに対して,マルチエージェントデータ分析システムであるLAMBDAを適用し,システム評価の信頼性と品質向上の戦略について検討した。
参考スコア（独自算出の注目度）: 1.8463920355489722
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agentic data analysis systems produce rich outputs, including code, numerical results, and verbal diagnostics. This makes them more challenging to evaluate than single-turn LLM responses. It is therefore necessary to distinguish genuine disagreement between an agent's output and a ground-truth answer from grading artifacts. We investigate how reliably automated graders assess such a system and what strategies improve grading quality by applying LAMBDA, a multi-agent data-analysis system, on 153 numerical QRData tasks from DSGym. We develop and evaluate a three-layer human-AI grading cascade: strict regex matching, LLM-based lenient grading, and snippet-based human inspection, which combines non-GenAI and GenAI strategies with different failure profiles. Both automated graders achieve 100% observed precision (0/70 false positives). The lenient grader's recall is 97% against human labels. A keyword-anchored extraction pipeline raises the strict grader's recall by 60 percentage points over a last-number heuristic; the lenient grader is architecturally parser-independent. An iterative nudge mechanism raises grading run success from 36% to 97% and lenient-pass rates from 16% to 46%; comparing nudging with and without original-question re-injection shows that re-injection offers no benefit, confirming the nudge as an answer template cue. We further observe in this case study that variable type is the task metadata field most consistently associated with grading pipeline dynamics and observed outcome grades.
Abstract（参考訳）: エージェントデータ分析システムは、コード、数値結果、言語診断を含む豊富な出力を生成する。これにより、シングルターンLDM応答よりも評価が難しい。したがって、エージェントの出力と根本的真正解との真に不一致を分解物と区別する必要がある。 DSGymから153個の数値QRDataタスクに対して,マルチエージェントデータ分析システムであるLAMBDAを適用し,システム評価の信頼性と品質向上の戦略について検討した。我々は,厳密なリジェックスマッチング,LLMに基づくレジェントグレーディング,スニペットベースのヒューマンインスペクションという,非GenAI戦略とGenAI戦略を異なる障害プロファイルで組み合わせた3層型ヒューマンAIグレーディングカスケードを開発し,評価する。両方の自動グレーダは100%の観測精度(0/70偽陽性)を達成する。優秀な学士のリコールは、人間のラベルに対して97%である。キーワードアンコールされた抽出パイプラインは、ラストナンバーのヒューリスティックに対して厳格グレーダのリコールを60ポイント引き上げる。反復的帰納機構は、段階的なラン成功を36%から97%に、寛大なパスレートを16%から46%に引き上げる。このケースでは、変数型がタスクメタデータフィールドであり、グレーディングパイプラインのダイナミクスと観測結果のグルーピングに最も一貫した関係があることが観察される。

論文の概要: Grading the Grader: Lessons from Evaluating an Agentic Data Analysis System

関連論文リスト