Related papers: FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding

FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding

URL: http://arxiv.org/abs/2509.22237v1
Date: Fri, 26 Sep 2025 11:47:50 GMT
Title: FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding
Authors: Haorui Chen, Chengze Li, Jia Li,
Abstract summary: FeatBench is a novel benchmark for vibe coding that focuses on feature implementation.<n> FeatBench is built on a multi-level filtering pipeline to ensure quality and a fully automated pipeline to evolve the benchmark.<n>Our evaluation reveals that feature implementation within the vibe coding paradigm is a significant challenge, with the highest success rate of only 29.94%.
Score: 11.846768103642583
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid advancement of Large Language Models (LLMs) has given rise to a novel software development paradigm known as "vibe coding," where users interact with coding agents through high-level natural language. However, existing evaluation benchmarks for code generation inadequately assess an agent's vibe coding capabilities. Existing benchmarks are misaligned, as they either require code-level specifications or focus narrowly on issue-solving, neglecting the critical scenario of feature implementation within the vibe coding paradiam. To address this gap, we propose FeatBench, a novel benchmark for vibe coding that focuses on feature implementation. Our benchmark is distinguished by several key features: 1. Pure Natural Language Prompts. Task inputs consist solely of abstract natural language descriptions, devoid of any code or structural hints. 2. A Rigorous & Evolving Data Collection Process. FeatBench is built on a multi-level filtering pipeline to ensure quality and a fully automated pipeline to evolve the benchmark, mitigating data contamination. 3. Comprehensive Test Cases. Each task includes Fail-to-Pass (F2P) and Pass-to-Pass (P2P) tests to verify correctness and prevent regressions. 4. Diverse Application Domains. The benchmark includes repositories from diverse domains to ensure it reflects real-world scenarios. We evaluate two state-of-the-art agent frameworks with four leading LLMs on FeatBench. Our evaluation reveals that feature implementation within the vibe coding paradigm is a significant challenge, with the highest success rate of only 29.94%. Our analysis also reveals a tendency for "aggressive implementation," a strategy that paradoxically leads to both critical failures and superior software design. We release FeatBench, our automated collection pipeline, and all experimental results to facilitate further community research.

Related papers

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? [61.247730037229815]
We introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope.<n>To investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities.<n>This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.
arXiv Detail & Related papers (2026-03-03T17:52:01Z)
ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development [49.63491095660809]
ProjDevBench is an end-to-end benchmark that provides project requirements to coding agents and evaluates the resulting repositories.<n>We curate 20 programming problems across 8 categories, covering both concept-oriented tasks and real-world application scenarios.<n>Our evaluation reports an overall acceptance rate of 27.38%: agents handle basic functionality but struggle with complex system design, time optimization, and resource management.
arXiv Detail & Related papers (2026-02-02T05:17:23Z)
RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z)
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development [72.4729759618632]
We introduce ABC-Bench, a benchmark to evaluate agentic backend coding within a realistic, executable workflow.<n>We curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories.<n>Our evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks.
arXiv Detail & Related papers (2026-01-16T08:23:52Z)
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models [59.90381306452982]
evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer.<n>We introduce SWE-1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework.<n>SWE- spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests.
arXiv Detail & Related papers (2025-11-07T18:01:32Z)
CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments [1.3999481573773072]
We introduce a multi-language benchmark that evaluates instruction-following capabilities and is to operate on any set of standalone coding problems.<n>Our benchmark evaluates instruction following in two key settings: adherence to pre-defined constraints specified with the initial problem, and the ability to perform refinements based on follow-up instructions.
arXiv Detail & Related papers (2025-10-31T15:47:07Z)
Benchmarking Correctness and Security in Multi-Turn Code Generation [41.75392001830794]
We introduce MTSec, the first benchmark to evaluate correctness and security in multi-turn coding scenarios.<n>We evaluate 32 open- and closed-source models, and three agent-scaffolding on MT-Sec.<n>We find that while agent-generated scaffoldings boost single-turn code generation performance, they are not quite as effective in multiturn evaluations.
arXiv Detail & Related papers (2025-10-13T01:20:46Z)
You Don't Know Until You Click:Automated GUI Testing for Production-Ready Software Evaluation [24.956175875766952]
RealDevWorld is an evaluation framework for large language models (LLMs) and code agents in software development.<n>It features two key components: RealDevBench, a collection of 194 open-ended software engineering tasks, and AppEvalPilot, a new agent-as-a-judge evaluation system.<n> Empirical results show that RealDevWorld delivers effective, automatic, and human-aligned evaluations.
arXiv Detail & Related papers (2025-08-17T07:31:11Z)
Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes [33.80591142965565]
We present CODE2BENCH, a pipeline for dynamically constructing robust and contamination-resistant benchmarks from real-world GitHub repositories.<n>Specifically, CODE2BENCH introduces three key innovations: (1) Automated Dynamism, achieved through periodic ingestion of recent code to minimize training data contamination; (2) Scope Graph-based dependency analysis, which enables structured classification of functions into benchmark instances with controlled dependency levels; and (3) Property-Based Testing (PBT) for the automated synthesis of rigorous test suites.
arXiv Detail & Related papers (2025-08-10T05:06:36Z)
Evaluating Large Language Models on Non-Code Software Engineering Tasks [4.381476817430934]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code understanding and generation.<n>We present the first comprehensive benchmark, which we name Software Engineering Language Understanding' (SELU)<n>SELU covers classification, regression, Named Entity Recognition (NER) and Masked Language Modeling (MLM) targets, with data drawn from diverse sources.
arXiv Detail & Related papers (2025-06-12T15:52:32Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
Towards Practical Defect-Focused Automated Code Review [8.370750734081088]
We explore the full automation pipeline within the online recommendation service, analyzing industry-grade C++s.<n>We identify four key challenges: 1) capturing relevant context, 2) improving key inclusion, 3) reducing false alarm rates (FAR), and 4) integrating human bug slicing.<n>Our approach, validated on real-world merge requests from historical fault reports, achieves a 2x improvement over standard LLMs and a 10x gain over previous baselines.
arXiv Detail & Related papers (2025-05-23T14:06:26Z)
SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents [49.73885480071402]
We introduce SWE-PolyBench, a new benchmark for repository-level, execution-based evaluation of coding agents.<n>SWE-PolyBench contains 2110 instances from 21 repositories and includes tasks in Java (165), JavaScript (1017), TypeScript (729) and Python (199), covering bug fixes, feature additions, and code.<n>Our experiments show that current agents exhibit uneven performances across languages and struggle with complex problems while showing higher performance on simpler tasks.
arXiv Detail & Related papers (2025-04-11T17:08:02Z)
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [98.18244218156492]
Large Language Models (LLMs) have significantly advanced natural language processing.<n>As their applications expand into multi-agent environments, there arises a need for a comprehensive evaluation framework.<n>This work introduces a novel competition-based benchmark framework to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.