Fugu-MT 論文翻訳(概要): ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

論文の概要: ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

arxiv url: http://arxiv.org/abs/2601.11077v1
Date: Fri, 16 Jan 2026 08:23:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-19 20:21:50.406158
Title: ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
Title（参考訳）: ABC-Bench: 実世界開発におけるエージェントバックエンドコーディングのベンチマーク
Authors: Jie Yang, Honglin Guo, Li Ji, Jiazheng Zhou, Rui Zheng, Zhikai Lei, Shuo Zhang, Zhiheng Xi, Shichun Liu, Yuxin Wang, Bo Wang, Yining Zheng, Tao Gui, Xipeng Qiu,
Abstract要約: 本稿では,現実的かつ実行可能なワークフロー内でエージェントバックエンドコーディングを評価するベンチマークであるABC-Benchを紹介する。オープンソースリポジトリから8つの言語と19のフレームワークにまたがる224の実践的なタスクをキュレートしました。我々の評価は、最先端モデルでさえ、これらの総合的なタスクに対して信頼性の高いパフォーマンスを提供するのに苦労していることを示している。
参考スコア（独自算出の注目度）: 72.4729759618632
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository-level, and execution-driven problem solving. However, current benchmarks predominantly evaluate code logic in static contexts, neglecting the dynamic, full-process requirements of real-world engineering, particularly in backend development which demands rigorous environment configuration and service deployment. To address this gap, we introduce ABC-Bench, a benchmark explicitly designed to evaluate agentic backend coding within a realistic, executable workflow. Using a scalable automated pipeline, we curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories. Distinct from previous evaluations, ABC-Bench require the agents to manage the entire development lifecycle from repository exploration to instantiating containerized services and pass the external end-to-end API tests. Our extensive evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks, highlighting a substantial disparity between current model capabilities and the demands of practical backend engineering. Our code is available at https://github.com/OpenMOSS/ABC-Bench.
Abstract（参考訳）: 大規模言語モデル(LLM)の自律エージェントへの進化により、AIコーディングの範囲は、ローカライズされたコード生成から、複雑でリポジトリレベル、実行駆動の問題解決へと拡大した。しかし、現在のベンチマークは主に静的なコンテキストにおけるコードロジックを評価し、特に厳格な環境設定とサービスデプロイメントを必要とするバックエンド開発において、現実のエンジニアリングの動的なフルプロセス要件を無視している。このギャップに対処するためにABC-Benchという,現実的かつ実行可能なワークフロー内でエージェントバックエンドコーディングを評価するために設計されたベンチマークを紹介した。スケーラブルな自動パイプラインを使用して、オープンソースリポジトリから8つの言語と19のフレームワークにまたがる224の実践的なタスクをキュレートしました。これまでの評価とは違って、ABC-Benchでは、リポジトリ探索からコンテナ化されたサービスのインスタンス化、外部のエンドツーエンドAPIテストのパスに至るまで、開発ライフサイクル全体を管理するようエージェントに要求している。我々の広範な評価は、最先端モデルでさえ、これらの総合的なタスクに対して信頼性の高いパフォーマンスを提供するのに苦労していることを示し、現在のモデル能力と実用的なバックエンドエンジニアリングの要求とのかなりの相違を浮き彫りにしている。私たちのコードはhttps://github.com/OpenMOSS/ABC-Bench.comで公開されています。

論文の概要: ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

関連論文リスト