Fugu-MT 論文翻訳(概要): SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications?

論文の概要: SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications?

arxiv url: http://arxiv.org/abs/2602.09540v1
Date: Tue, 10 Feb 2026 08:51:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-11 20:17:43.459805
Title: SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications?
Title（参考訳）: SWE-Bench Mobile: 大規模言語モデルエージェントは,産業レベルのモバイルアプリを開発することができるか?
Authors: Muxin Tian, Zhe Wang, Blair Yang, Zhenwei Tang, Kunlun Zhu, Honghua Dong, Hanchen Li, Xinni Xie, Guangjing Wang, Jiaxuan You,
Abstract要約: SWE-Bench Mobileは、実運用iOSから派生した現実的なソフトウェアエンジニアリングタスクのコーディングエージェントを評価するためのベンチマークである。孤立した問題やバグ修正に焦点を当てた既存のベンチマークとは異なり、SWE-Bench Mobileは産業開発における完全な複雑さを捉えている。
参考スコア（独自算出の注目度）: 21.241252187534055
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Can large language model agents develop industry-level mobile applications? We introduce \textbf{SWE-Bench Mobile}, a benchmark for evaluating coding agents on realistic software engineering tasks derived from a production iOS codebase. Unlike existing benchmarks that focus on isolated problems or bug fixes, SWE-Bench Mobile captures the full complexity of industrial development: multi-modal inputs (PRDs and Figma designs), a large-scale mixed Swift/Objective-C codebase, and comprehensive test suites. We evaluate 22 agent-model configurations across four coding agents -- three commercial (Cursor, Codex, Claude Code) and one open-source (OpenCode) -- and find that even the best configurations achieve only 12\% task success rate. Our analysis reveals that (1) agent design matters as much as model capability -- the same model shows up to 6$\times$ performance gap across agents, (2) commercial agents consistently outperform open-source alternatives, and (3) simple ``Defensive Programming'' prompts outperform complex ones by 7.4\%. These findings highlight a significant gap between current agent capabilities and industrial requirements, while providing actionable insights for practitioners and researchers. We release SWE-Bench Mobile as a \textit{hosted benchmark challenge} to prevent data contamination and ensure fair evaluation. The public leaderboard and development toolkit are available at https://swebenchmobile.com.
Abstract（参考訳）: 大規模言語モデルエージェントは産業レベルのモバイルアプリケーションを開発することができるか? 実運用iOSコードベースから派生した現実的なソフトウェアエンジニアリングタスクのコーディングエージェントを評価するためのベンチマークである。独立した問題やバグフィックスにフォーカスする既存のベンチマークとは異なり、SWE-Bench Mobileは、マルチモーダルインプット(PRDとFigmaの設計)、大規模に混合されたSwift/Objective-Cコードベース、包括的なテストスイートといった、産業開発の全複雑さを捉えている。我々は、4つのコーディングエージェント(Cursor, Codex, Claude Code)と1つのオープンソース(OpenCode)の22のエージェントモデル構成を評価します。そして、最高の構成でさえ、たった12倍のタスク成功率しか達成できないことを発見します。これらの知見は、現在のエージェント能力と工業的要件との間に大きなギャップを浮き彫りにし、実践者や研究者に実用的な洞察を与えている。 SWE-Bench Mobile を \textit{hosted benchmark challenge} としてリリースし、データの汚染を防止し、公正な評価を保証する。公開リーダボードと開発ツールキットはhttps://swebenchmobile.comで公開されている。

論文の概要: SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications?

関連論文リスト