Fugu-MT 論文翻訳(概要): SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

論文の概要: SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

arxiv url: http://arxiv.org/abs/2605.17526v1
Date: Sun, 17 May 2026 16:15:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:48.127696
Title: SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering
Title（参考訳）: SaaSBench: 長距離エンタープライズSaaSエンジニアリングにおけるコーディングエージェントの境界を探る
Authors: Qingnan Ren, Shun Zou, Shiting Huang, Ziao Zhang, Kou Shi, Zhen Fang, Yiming Zhao, Yu Zeng, Qisheng Su, Lin Chen, Yong Wang, Zehui Chen, Xiangxiang Chu, Feng Zhao,
Abstract要約: 私たちは、エンタープライズエンジニアリングにおけるAIエージェントの境界を調査するために設計された最初のベンチマークであるBenchを紹介します。 8つのプログラミング言語、6つのデータベース、13のフレームワークを組み込んで、現実世界のソフトウェアを巧みにミラーリングする。最先端エージェントの主なボトルネックは、独立したコードロジックを生成するのではなく、マルチコンポーネントシステムの構成と統合に成功していることが示される。
参考スコア（独自算出の注目度）: 42.16295498118832
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As autonomous coding agents become capable of handling increasingly long-horizon tasks, they have gradually demonstrated the potential to complete end-to-end software development. Although existing benchmarks have recently evolved from localized code editing to from-scratch project generation, they remain confined to structurally simplified, single-stack applications. Consequently, they fail to capture the heterogeneous environments, full-stack orchestration, and system-level complexity of real enterprise Software as a Service (SaaS) systems, leaving a critical gap in assessing agents under realistic engineering constraints. To fill this gap, we introduce SaaSBench, the first benchmark designed to explore the boundaries of AI agents in enterprise SaaS engineering. Spanning 30 complex tasks across 6 SaaS domains with 5,370 validation nodes, it incorporates 8 programming languages, 6 databases, and 13 frameworks to meticulously mirror real-world software heterogeneity. Furthermore, we design a dependency-aware hybrid evaluation paradigm tailored for complex systems with long horizons and multi-component coupling, enabling fine-grained, reproducible assessment. Crucially, our extensive experiments reveal a striking insight: the primary bottleneck for state-of-the-art agents is not generating isolated code logic, but successfully configuring and integrating a multi-component system. Over 95\% of task failures occur before agents even reach deep business logic, with models often falling victim to overconfidence and prematurely halting during foundational system setup, or getting trapped in ineffective debugging loops. We hope SaaSBench serves as a practical and challenging testbed to drive the evolution of reliable, system-level coding agents. The code is available at \url{https://github.com/ShadeCloak/SaaSbench}.
Abstract（参考訳）: 自律的なコーディングエージェントがますます長期のタスクを処理できるようになると、彼らは徐々にエンドツーエンドのソフトウェア開発を完了させる可能性を実証してきた。既存のベンチマークはローカライズされたコード編集からoff-scratchプロジェクト生成へと進化してきたが、それでも構造的に単純化されたシングルスタックアプリケーションに限られている。その結果、実際のエンタープライズソフトウェア・アズ・ア・サービス(SaaS)システムの異種環境、フルスタックのオーケストレーション、システムレベルの複雑さを捉えることができず、エージェントを現実的なエンジニアリング上の制約の下で評価する上で重要なギャップを残します。このギャップを埋めるために、エンタープライズSaaSエンジニアリングにおけるAIエージェントの境界を探るための最初のベンチマークであるSaaSBenchを紹介します。 5,370のバリデーションノードを持つ6つのSaaSドメインに30の複雑なタスクを分散させ、8つのプログラミング言語、6つのデータベース、13のフレームワークを組み込んで、現実世界のソフトウェア不均一性を巧みにミラーリングする。さらに,長い水平線と多成分結合を持つ複雑なシステムに適した依存性を考慮したハイブリッド評価パラダイムを設計し,よりきめ細かな再現可能な評価を可能にする。最先端エージェントの主なボトルネックは、独立したコードロジックを生成するのではなく、マルチコンポーネントシステムの設定と統合に成功していることです。エージェントが深いビジネスロジックに到達する前に、95%以上のタスク障害が発生します。 SaaSBenchが、信頼性の高いシステムレベルのコーディングエージェントの進化を促進するために、実用的で挑戦的なテストベッドとして機能することを願っています。コードは \url{https://github.com/ShadeCloak/SaaSbench} で公開されている。

論文の概要: SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

関連論文リスト