DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle
- URL: http://arxiv.org/abs/2601.20882v1
- Date: Tue, 27 Jan 2026 18:43:46 GMT
- Title: DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle
- Authors: Yuheng Tang, Kaijie Zhu, Bonan Ruan, Chuqi Zhang, Michael Yang, Hongwei Li, Suyue Guo, Tianneng Shi, Zekun Li, Christopher Kruegel, Giovanni Vigna, Dawn Song, William Yang Wang, Lun Wang, Yangruibo Ding, Zhenkai Liang, Wenbo Guo,
- Abstract summary: Handling the DevOps cycle in real-world software requires analyzing large-scale projects, understanding dynamic program behaviors, leveraging domain-specific tools, and making sequential decisions.<n>We introduce DevOps-Gym, the first end-to-end benchmark for evaluating AI agents across core DevOps tasks.
- Score: 84.01703913780946
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Even though demonstrating extraordinary capabilities in code generation and software issue resolving, AI agents' capabilities in the full software DevOps cycle are still unknown. Different from pure code generation, handling the DevOps cycle in real-world software, including developing, deploying, and managing, requires analyzing large-scale projects, understanding dynamic program behaviors, leveraging domain-specific tools, and making sequential decisions. However, existing benchmarks focus on isolated problems and lack environments and tool interfaces for DevOps. We introduce DevOps-Gym, the first end-to-end benchmark for evaluating AI agents across core DevOps workflows: build and configuration, monitoring, issue resolving, and test generation. DevOps-Gym includes 700+ real-world tasks collected from 30+ projects in Java and Go. We develop a semi-automated data collection mechanism with rigorous and non-trivial expert efforts in ensuring the task coverage and quality. Our evaluation of state-of-the-art models and agents reveals fundamental limitations: they struggle with issue resolving and test generation in Java and Go, and remain unable to handle new tasks such as monitoring and build and configuration. These results highlight the need for essential research in automating the full DevOps cycle with AI agents.
Related papers
- EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots [68.29056647487519]
Embodied AI is fueled by high-fidelity simulation and large-scale data collection.<n>However, this scaling capability remains bottlenecked by a reliance on labor-intensive manual oversight.<n>We introduce textscEmboCoach-Bench, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies.
arXiv Detail & Related papers (2026-01-29T11:33:49Z) - IDE-Bench: Evaluating Large Language Models as IDE Agents on Real-World Software Engineering Tasks [0.37823923040445995]
We present a Dockerized test harness that goes beyond raw terminal execution.<n>We provide high-level abstractions for search, structured file editing, and tools for testing full-stack applications.<n>For evaluation, we created 80 tasks across eight never-published C/C++, Java, MERN stacks.
arXiv Detail & Related papers (2026-01-28T02:06:37Z) - ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development [72.4729759618632]
We introduce ABC-Bench, a benchmark to evaluate agentic backend coding within a realistic, executable workflow.<n>We curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories.<n>Our evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks.
arXiv Detail & Related papers (2026-01-16T08:23:52Z) - Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents [71.85020581835042]
Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck.<n>Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail.<n>We introduce Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning.
arXiv Detail & Related papers (2025-10-29T16:59:07Z) - Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI [0.36868085124383626]
Review presents a comprehensive analysis of two emerging paradigms in AI-assisted software development: vibe coding and agentic coding.<n> Vibe coding emphasizes intuitive, human-in-the-loop interaction through prompt-based, conversational interaction.<n>Agentic coding enables autonomous software development through goal-driven agents capable of planning, executing, testing, and iterating tasks with minimal human intervention.
arXiv Detail & Related papers (2025-05-26T03:00:21Z) - TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks [55.03911355902567]
We introduce TheAgentCompany, a benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker.<n>We find that the most competitive agent can complete 30% of tasks autonomously.<n>This paints a nuanced picture on task automation with simulating LM agents in a setting a real workplace.
arXiv Detail & Related papers (2024-12-18T18:55:40Z) - Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - A Mixed Method Study of DevOps Challenges [2.2957483176038584]
We conduct an empirical study by applying topic modeling on 174K SO posts that contain DevOps discussions.
We then validate and extend the empirical study findings with a survey of 21 professional DevOps practitioners.
arXiv Detail & Related papers (2024-03-25T05:35:40Z) - AutoDev: Automated AI-Driven Development [9.586330606828643]
AutoDev is a fully automated AI-driven software development framework.
It enables users to define complex software engineering objectives, which are assigned to AutoDev's autonomous AI Agents.
AutoDev establishes a secure development environment by confining all operations within Docker containers.
arXiv Detail & Related papers (2024-03-13T07:12:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.