Fugu-MT 論文翻訳(概要): More than a Judge: An Empirical Study of Agent-Human Interaction in Crowdsourced Testing Assessment

論文の概要: More than a Judge: An Empirical Study of Agent-Human Interaction in Crowdsourced Testing Assessment

arxiv url: http://arxiv.org/abs/2606.06301v1
Date: Thu, 04 Jun 2026 15:39:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.911025
Title: More than a Judge: An Empirical Study of Agent-Human Interaction in Crowdsourced Testing Assessment
Title（参考訳）: 審査員以上:クラウドソーシングテストアセスメントにおけるエージェントとヒューマンの相互作用に関する実証的研究
Authors: Yue Wang, Yuan Zhao, Shengcheng Yu, Zhenyu Chen, Qing Gu,
Abstract要約: 本研究では,LLM-as-a-Judgeパラダイムに基づくマルチエージェント評価バックボーンの開発と評価を行った。しかし、信頼性の高い自動判断は、エージェントの出力がワークフローに埋め込まれた時に人間の作業を改善するかどうかをそれ自体が示さない。本稿では,評価に基づく行動フィードバックが,テスト担当者の報告の修正方法,その後のタスクの実行方法,アプリケーション間での報告プラクティスの伝達方法を改善するかを検討する。
参考スコア（独自算出の注目度）: 16.700895092783266
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Agentic AI is increasingly being integrated into software engineering workflows. In crowdsourced testing, however, the large volume and uneven quality of submitted reports still create a substantial review burden for developers. In prior work, we developed and validated a multi-agent assessment backbone based on the LLM-as-a-Judge paradigm. That backbone assesses reports along three dimensions--textuality, adequacy, and competitiveness--and was shown to align well with human consensus while substantially reducing assessment effort. Yet reliable automated judging does not by itself show whether agent outputs can improve human work when embedded into workflow. This paper studies that missing question in the context of crowdsourced testing. We investigate whether assessment-derived, actionable feedback can improve how testers revise reports, perform on later tasks, and transfer reporting practices across applications. To do so, we conducted a controlled four-stage human-subject study with 20 testers across three real-world applications. The results show that agent-generated feedback supports immediate improvements in revised reports, better first submissions on a new task after prior feedback exposure, and evidence of partial but meaningful transfer to a later application. A post-task questionnaire completed by 17 participants complements these artifact-based findings by suggesting that the feedback was generally understandable, acted upon in revision, and carried into later tasks, while also revealing remaining friction in specificity and execution. Overall, the study provides empirical evidence that, in the studied crowdsourced testing setting, assessment agents can serve not only as post-hoc judges but also as workflow-integrated feedback providers that support upstream report-quality improvement.
Abstract（参考訳）: Agentic AIはますます、ソフトウェアエンジニアリングワークフローに統合されている。しかし、クラウドソーステストでは、提出されたレポートの膨大な量と不均一な品質が、開発者にとってかなりのレビュー負担を生んでいる。本研究では,LLM-as-a-Judgeパラダイムに基づくマルチエージェント評価バックボーンの開発と評価を行った。そのバックボーンは、テクスチュアリティ、妥当性、競争性の3つの側面に沿ってレポートを評価し、人間のコンセンサスと整合し、評価の労力を大幅に削減することを示した。しかし、信頼性の高い自動判断は、エージェントの出力がワークフローに埋め込まれた時に人間の作業を改善するかどうかをそれ自体が示さない。本稿では,クラウドソーシングテストの文脈で欠落する問題について考察する。評価に基づく行動可能なフィードバックが、テスタの報告の修正方法、後のタスクの実行方法、アプリケーション間での報告プラクティスの伝達方法を改善することができるかどうかを検討する。そこで本研究では,実世界の3つのアプリケーションを対象に,20名のテスタを対象に,制御された4段階の人体実験を行った。その結果、エージェント生成フィードバックは、修正されたレポートの即時改善、事前のフィードバック露光後のタスクへの最初の提案の改善、そして、後続のアプリケーションへの部分的かつ有意義な移行の証拠を示す。 17名の参加者によるタスク終了後のアンケートでは、フィードバックは一般的に理解可能であり、リビジョンで実行され、後続のタスクに実行され、特異性と実行における摩擦が残ることを示唆し、これらの成果を補完する。この研究は、クラウドソーシングテスト環境では、アセスメントエージェントがポストホックな判断だけでなく、上流のレポート品質改善をサポートするワークフロー統合フィードバックプロバイダとしても機能する、という実証的な証拠を提供する。

論文の概要: More than a Judge: An Empirical Study of Agent-Human Interaction in Crowdsourced Testing Assessment

関連論文リスト