Fugu-MT 論文翻訳(概要): LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

論文の概要: LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

arxiv url: http://arxiv.org/abs/2604.27727v1
Date: Thu, 30 Apr 2026 11:20:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:54.06242
Title: LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding
Title（参考訳）: LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding
Authors: Md Faizul Ibne Amin, Yutaka Watanobe, Daniel M. Muepu, Haruto Suzuki, Kenta Nanaumi, Md Mostafizer Rahman,
Abstract要約: rubric-driven LLM-as-a-Judge frameworkは、コンテストスタイルの人間-AI共同制作のためのフレームワークである。共同創造の成功は早期に集中していることが判明し、サクセス・アット・トゥルンは初めて観測されたターンで0.8533まで上昇した。判定側では、ROC-AUCは0.5937、PR-AUCは0.6904、MCCテストは0.5000に達する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLMs are increasingly employed both as judges for evaluating open-ended outputs and as co-creation partners in AI-assisted programming; yet rigorous evaluation in human-AI co-creation settings remains underdeveloped as judgments must be reliable, comparable across models, and interpretable over multi-turn interaction. To address this gap, a rubric-driven LLM-as-a-Judge framework is presented for contest-style human-AI co-creation in coding and software engineering (SE). The framework is built around schema-constrained judge outputs, validation and repair mechanisms, grouped and split by user and problem to prevent trajectory leakage, and participant-level NONBLIND context. Multiple LLM judges are assessed through a multi-metric protocol covering discrimination (ROC-AUC, PR-AUC), thresholded decision quality (MCC), probabilistic reliability (LogLoss, Brier score, ECE), and inter-judge agreement (Cohen's and Fleiss' k). Human-AI co-creation is further examined through trajectory-level signals, including turn-wise confidence, Success-at-Turn, time-to-success, revision churn, and CodeBLEU. Co-creation success is found to concentrate early, with Success-at-Turn rising to 0.8533 at the first observed turn and stabilizing at 0.8641 by turn 6. Revision behavior, however, remains heterogeneous, suggesting that productive progress can emerge through either incremental refinement or broader restructuring. On the judging side, the best held-out scores reach 0.5937 for ROC-AUC, 0.6904 for PR-AUC, and 0.5000 for MCC test, while inter-judge consistency remains modest overall (mean pairwise Cohen's k = 0.1592, Fleiss' k = 0.0696). Taken together, this work offers an auditable and reproducible evaluation methodology that links reliability-aware LLM judging with trajectory-based analysis of human-AI co-creation, providing a practical evaluation template for future AI-assisted coding and SE.
Abstract（参考訳）: LLMは、オープンエンドのアウトプットを評価するための審査員や、AI支援プログラミングにおける共同作成パートナとして採用されているが、人間とAIのコクリエーション設定における厳密な評価は、モデル間での信頼性、マルチターンインタラクションよりも解釈可能な判断が必要であるため、未開発のままである。このギャップに対処するために、コーディングとソフトウェア工学(SE)におけるコンテストスタイルの人間-AI共創のために、ルーリック駆動のLLM-as-a-Judgeフレームワークが提示される。このフレームワークは、スキーマに制約された判断出力、検証と修復のメカニズム、ユーザのグループ化と分割によるトラジェクトリの漏洩防止、および参加者レベルのNONBLINDコンテキストを中心に構築されている。複数のLCM判事は、差別をカバーするマルチメトリックプロトコル(ROC-AUC, PR-AUC)、しきい値決定品質(MCC)、確率的信頼性(LogLoss, Brier score, ECE)、およびジャッジ間合意(Cohen's and Fleiss' k)を通じて評価される。人間-AIの共創は、ターンワイド信頼、成功-アット-トゥーン、タイム・トゥ・サクセス、リビジョン・チャーン、CodeBLEUなど、軌道レベルの信号によってさらに検討される。共同創造の成功は早期に集中し、最初の観測では0.8533まで上昇し、ターン6では0.8641に安定化した。しかし、リビジョンの行動は相変わらず不均一であり、漸進的な洗練またはより広範なリストラクチャリングによって生産的な進歩が生じる可能性があることを示唆している。判定側では、ROC-AUCは0.5937点、PR-AUCは0.6904点、MCCテストは0.5000点である。この研究は、人間とAIの共創の軌跡に基づく分析から判断し、信頼性を意識したLLMをリンクする監査可能な再現可能な評価手法を提供し、将来のAI支援コーディングのための実用的な評価テンプレートとSEを提供する。

論文の概要: LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

関連論文リスト