Fugu-MT 論文翻訳(概要): Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents

論文の概要: Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents

arxiv url: http://arxiv.org/abs/2606.08200v1
Date: Sat, 06 Jun 2026 14:37:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:05.966144
Title: Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents
Title（参考訳）: オンラインエージェント・アズ・ア・ジャッジ:インタラクティブエージェントの状況生成評価
Authors: Hyogon Ryu, Jeonghwan Kim, Yewon Lim, Chaeun Lee, Jeongwook Kim, Donghoon Ham,
Abstract要約: 対話型ソーシャルエージェントのための状況生成評価フレームワークであるOnline Agent-as-a-Judgeを提案する。 Online Agent-as-a-Judgeは、環境のネイティブ対話とアクションプロトコルを通じてターゲットエージェントと対話する、現実世界の評価エージェントをデプロイする。オンラインエージェント・アズ・ア・ジャッジ(Online Agent-as-a-Judge)は、デザイナーによる32ドルの社会的基準を持つライフシミュレート環境において、人間のラベルに対する基準範囲と合意を改善している。
参考スコア（独自算出の注目度）: 7.750851374657493
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating LLM-powered interactive social agents is challenging because socially relevant behaviors depend not only on isolated outputs, but also on prior interactions, social roles, and downstream actions. Existing methods typically allow a target agent to act freely in an environment and then score the resulting trajectory. However, this passive setup can miss capabilities that only become observable under specific social circumstances; for example, conflict handling may remain untested if no disagreement arises. We propose Online Agent-as-a-Judge, a situation-generating evaluation framework for interactive social agents. Online Agent-as-a-Judge deploys an in-world evaluator agent that interacts with the target agent through the environment's native dialogue and action protocol, actively eliciting situations relevant to the evaluation criteria. The resulting trajectories provide evidence for assessing both immediate responses and subsequent behavior. In a life-simulation environment with $32$ designer-authored social criteria, Online Agent-as-a-Judge improves criteria coverage and agreement with human labels, yielding more reliable evidence-grounded evaluations of behaviors that passive methods can leave unobserved.
Abstract（参考訳）: LLMを利用した対話型ソーシャルエージェントの評価は、社会的に関係のある行動は、孤立したアウトプットだけでなく、事前の相互作用、社会的役割、下流行動にも依存するため、困難である。既存の方法は、通常、ターゲットエージェントが環境下で自由に行動し、その結果の軌道を採点することを可能にする。しかし、この受動的セットアップは、特定の社会的状況下でのみ観察可能な能力を失う可能性がある。対話型ソーシャルエージェントのための状況生成評価フレームワークであるOnline Agent-as-a-Judgeを提案する。 Online Agent-as-a-Judgeは、環境のネイティブ対話およびアクションプロトコルを通じてターゲットエージェントと対話し、評価基準に関連する状況を積極的に引き出す、現実世界の評価エージェントをデプロイする。結果として得られた軌道は、即時反応とその後の行動の両方を評価する証拠となる。デザイナーが許可した社会的基準が32ドルある生活シミュレーション環境では、オンラインエージェント・アズ・ア・ジャッジは、人間のラベルとの基準範囲と合意を改善し、受動的手法が観察できない行動の評価をより信頼性の高い根拠で行う。

論文の概要: Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents

関連論文リスト