Fugu-MT 論文翻訳(概要): ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?

論文の概要: ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?

arxiv url: http://arxiv.org/abs/2606.19787v1
Date: Thu, 18 Jun 2026 04:43:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-19 18:23:39.650686
Title: ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?
Title（参考訳）: ORAgentBench: LLMエージェントは、調査タスクを終了させることができるか?
Authors: Jiajun Li, Mingshu Cai, Yixuan Li, Yu Ding, Ran Hou, Guanyu Nie, Xiongwei Han, Wanyuan Wang,
Abstract要約: ORAgentBenchは、自律エージェントを運用研究タスクで評価するための実行基盤ベンチマークである。さまざまな運用シナリオにまたがる107のヒューマンレビュータスクが含まれており、それぞれに自然言語で簡潔な複数ファイルデータ、設定アーティファクト、必要なスキーマがパッケージされている。 14のフロンティアエージェントモデルによる実験では、現在のエージェントは信頼性の高いORの実践から程遠いままである。
参考スコア（独自算出の注目度）: 28.383940617377856
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models are increasingly deployed as autonomous agents for multi-step tasks in executable environments, yet their ability to perform realistic operations research (OR) work remains unclear. Existing OR evaluations often decouple modeling from solving, rely on pre-formalized or text-only instances, and rarely test the full workflow from operational artifacts to validated decisions. In this work, we introduce ORAgentBench, an execution-grounded benchmark for evaluating autonomous agents on challenging end-to-end operations research tasks. It contains 107 human-reviewed tasks across diverse operational scenarios, each packaged in an isolated environment with a natural-language brief, multi-file data, configuration artifacts, and a required submission schema. Agents must write and run solution code, and their submissions are evaluated by hidden validators for schema validity, hard-constraint feasibility, and normalized objective quality. Experiments with fourteen frontier agent-model configurations show that current agents remain far from reliable OR practice. The best agent passes only 35.51% of all tasks and 20.59% of hard tasks, and many feasible submissions still fall below the required quality threshold. Failure analysis further shows that errors are dominated by strategic weaknesses, including missed operational rules, brittle formulations, weak feasible-solution construction, and insufficient solution improvement. OR-specific procedural skills increase hard-task feasibility, but do not reliably improve solution quality or pass rate. These results suggest that progress in OR agents requires moving beyond plausible optimization code toward dependable, high-quality operational decision-making.
Abstract（参考訳）: 大規模言語モデルは、実行可能環境における多段階タスクのための自律エージェントとしてますますデプロイされているが、現実的な操作研究(OR)の実施能力は未だ不明である。既存のOR評価は、しばしばモデリングを解決から切り離し、形式化されたインスタンスやテキストのみのインスタンスに依存し、完全なワークフローを運用成果物から検証された決定に至るまで、テストすることは滅多にない。本稿では,ORAgentBenchについて紹介する。ORAgentBenchは,エンド・ツー・エンドの運用研究課題に対する自律エージェントの評価を行うためのベンチマークである。さまざまな運用シナリオにまたがる107のヒューマンレビュータスクが含まれており、それぞれが独立した環境にパッケージされ、自然言語で簡潔な複数ファイルデータ、設定アーティファクト、必要なスキーマが提供されている。エージェントはソリューションコードを書き、実行しなければならない。それらの提出は、スキーマの妥当性、制約の厳しい実現可能性、正規化された客観的品質に関する隠れバリデータによって評価される。 14のフロンティアエージェントモデルによる実験では、現在のエージェントは信頼性の高いORの実践から程遠いままである。最高のエージェントは、全てのタスクの35.51%、ハードタスクの20.59%しか通過せず、多くの実行可能な提出は、依然として要求される品質基準を下回っている。失敗分析により、エラーは、運用ルールの欠如、不安定な定式化、弱い実現可能なソリューション構築、ソリューション改善の不十分など、戦略的弱さによっても支配されていることが示された。 OR固有の手続きスキルは、ハードタスクの実現性を高めるが、ソリューションの品質やパスレートを確実に改善しない。これらの結果から, ORエージェントの進展は, 信頼性の高い, 高品質な運用上の意思決定に向けて, 妥当な最適化コードを超えて進まなければならないことが示唆された。

論文の概要: ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?

関連論文リスト