IDE-Bench: Evaluating Large Language Models as IDE Agents on Real-World Software Engineering Tasks
- URL: http://arxiv.org/abs/2601.20886v1
- Date: Wed, 28 Jan 2026 02:06:37 GMT
- Title: IDE-Bench: Evaluating Large Language Models as IDE Agents on Real-World Software Engineering Tasks
- Authors: Spencer Mateega, Jeff Yang, Tiana Costello, Shaurya Jadhav, Nicole Tian, Agustin Garcinuño,
- Abstract summary: We present a Dockerized test harness that goes beyond raw terminal execution.<n>We provide high-level abstractions for search, structured file editing, and tools for testing full-stack applications.<n>For evaluation, we created 80 tasks across eight never-published C/C++, Java, MERN stacks.
- Score: 0.37823923040445995
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: IDE-Bench is a comprehensive framework for evaluating AI IDE agents on real-world software engineering tasks through an IDE-native tool interface. We present a Dockerized test harness that goes beyond raw terminal execution, granting models a structured tool ecosystem that represents AI-native IDEs like Cursor and Windsurf. By providing high-level abstractions for codebase search, structured file editing, and tools for testing full-stack applications, IDE-Bench evaluates an agent's ability to act as a true engineering collaborator. For evaluation and to prevent training data contamination, we created 80 tasks across eight never-published repositories spanning C/C++, Java, and MERN stacks, representing modern tech stack production scenarios, including feature implementation, bug fixing, refactoring, and performance optimization that mirror daily developer workflows in private codebases. Our benchmark is the first to systematically correlate agent-reported intent with successful project-level modifications in a multi-language, full-stack environment on completely uncontaminated code.
Related papers
- DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle [84.01703913780946]
Handling the DevOps cycle in real-world software requires analyzing large-scale projects, understanding dynamic program behaviors, leveraging domain-specific tools, and making sequential decisions.<n>We introduce DevOps-Gym, the first end-to-end benchmark for evaluating AI agents across core DevOps tasks.
arXiv Detail & Related papers (2026-01-27T18:43:46Z) - ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development [72.4729759618632]
We introduce ABC-Bench, a benchmark to evaluate agentic backend coding within a realistic, executable workflow.<n>We curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories.<n>Our evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks.
arXiv Detail & Related papers (2026-01-16T08:23:52Z) - SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents [49.73885480071402]
We introduce SWE-PolyBench, a new benchmark for repository-level, execution-based evaluation of coding agents.<n>SWE-PolyBench contains 2110 instances from 21 repositories and includes tasks in Java (165), JavaScript (1017), TypeScript (729) and Python (199), covering bug fixes, feature additions, and code.<n>Our experiments show that current agents exhibit uneven performances across languages and struggle with complex problems while showing higher performance on simpler tasks.
arXiv Detail & Related papers (2025-04-11T17:08:02Z) - Programming with Pixels: Can Computer-Use Agents do Software Engineering? [24.011063667060792]
$textttProgramming with Pixels$ (PwP) is the first comprehensive computer-use environment for software engineering.<n>PwP establishes software engineering as a natural domain for benchmarking whether generalist computer-use agents can reach specialist-level performance.
arXiv Detail & Related papers (2025-02-24T18:41:33Z) - In-IDE Programming Courses: Learning Software Development in a Real-World Setting [5.330251011543498]
JetBrains recently released the JetBrains Academy plugin, which customizes the IDE for learners.<n>We carried out eight one-hour interviews with students and developers who completed at least one course using the plugin.
arXiv Detail & Related papers (2025-01-29T16:34:22Z) - Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.<n>Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.<n> Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z) - Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - OpenHands: An Open Platform for AI Software Developers as Generalist Agents [109.8507367518992]
We introduce OpenHands, a platform for the development of AI agents that interact with the world in similar ways to a human developer.<n>We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, and incorporation of evaluation benchmarks.
arXiv Detail & Related papers (2024-07-23T17:50:43Z) - A New Generation of Intelligent Development Environments [0.0]
The practice of programming is undergoing a revolution with the introduction of AI assisted development (copilots) and the creation of new programming languages.
This paper presents a vision for transforming the Integrated Development Environment from an Integrated Development Environment to an Intelligent Development Environment.
arXiv Detail & Related papers (2024-06-13T20:33:25Z) - Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study [72.24266814625685]
We explore the performance of large language models (LLMs) across the entire software development lifecycle with DevEval.<n>DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task.<n> Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval.
arXiv Detail & Related papers (2024-03-13T15:13:44Z) - Tool-Augmented LLMs as a Universal Interface for IDEs [0.768721532845575]
Large Language Models (LLMs) capable of both natural language dialogue and code generation lead to a discourse on the obsolescence of the concept of Integrated Development Environments (IDEs)
We envision a model that is able to perform complex actions involving multiple IDE features upon user command, stripping the user experience of the tedious work involved in searching through options and actions.
arXiv Detail & Related papers (2024-02-18T16:32:28Z) - All You Need Is Logs: Improving Code Completion by Learning from
Anonymous IDE Usage Logs [55.606644084003094]
We propose an approach for collecting completion usage logs from the users in an IDE.
We use them to train a machine learning based model for ranking completion candidates.
Our evaluation shows that using a simple ranking model trained on the past user behavior logs significantly improved code completion experience.
arXiv Detail & Related papers (2022-05-21T23:21:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.