Fugu-MT 論文翻訳(概要): OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

論文の概要: OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

arxiv url: http://arxiv.org/abs/2604.10866v2
Date: Thu, 16 Apr 2026 16:00:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 16:09:14.150249
Title: OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Title（参考訳）: OccuBench: 言語環境シミュレーションによる実世界のプロフェッショナルタスクにおけるAIエージェントの評価
Authors: Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su, Lianghao Deng, Yuxuan Liu, Yantao Liu, Dayiheng Liu, Tsung-Yi Ho,
Abstract要約: OccuBenchは10の業界カテゴリと65の専門ドメインにわたる100の現実のプロフェッショナルタスクシナリオをカバーするベンチマークである。我々のマルチエージェント合成パイプラインは, 可溶性, 校正困難, 文書基底の多様性を保証した評価インスタンスを自動生成する。
参考スコア（独自算出の注目度）: 57.505743202759646
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LES-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.
Abstract（参考訳）: AIエージェントは数百の業務領域(緊急部署トリアージから原子炉安全監視から税関輸入処理まで)で専門的な作業を行うことが期待されているが、既存のベンチマークでは、公共環境が存在する数少ない領域のエージェントのみを評価することができる。 OccuBenchは10の業界カテゴリと65の専門ドメインにまたがる100の現実のプロフェッショナルタスクシナリオをカバーするベンチマークで、LES(Language Environment Simulator)によって実現され、LLM駆動のツール応答生成を通じてドメイン固有の環境をシミュレートする。我々のマルチエージェント合成パイプラインは, 可溶性, 校正困難, 文書基底の多様性を保証した評価インスタンスを自動生成する。 OccuBenchは2つの補完的な側面に沿ってエージェントを評価する。プロのドメイン間のタスク補完と、制御されたフォールトインジェクション下での環境ロバスト性(明示的なエラー、暗黙的なデータ劣化、混合障害)である。 1つのモデルが異なる職業能力プロファイルを持つため、1つのモデルがすべての業界を支配していないこと、2つの暗黙の欠陥(切り欠きデータ、欠落フィールド)が明示的なエラー(タイムアウト、500)と混在断層の両方よりも難しいこと、2つのエラー信号が不足し、エージェントがデータ劣化を独立に検出する必要があること、3より大きなモデル、新しい世代、より高い推論努力が一貫してパフォーマンスを改善すること。 GPT-5.2は最小値から最大値まで27.5ポイント改善され、(4)強いエージェントは必ずしも強い環境シミュレータではない。シミュレータの品質は、LESに基づく評価の信頼性に欠かせない。 OccuBenchは、プロフェッショナルな仕事上のタスクにおいて、AIエージェントの組織横断的な評価を初めて提供する。

論文の概要: OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

関連論文リスト