Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
- URL: http://arxiv.org/abs/2603.04601v1
- Date: Wed, 04 Mar 2026 21:00:33 GMT
- Title: Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
- Authors: Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu,
- Abstract summary: Existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch.<n>We introduce Vibe Code Bench, a benchmark of 100 web application specifications with 964 browser-based substeps.<n>Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results.
- Score: 6.072381417546439
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves only 58.0% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level agreement). Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results.
Related papers
- ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices [17.39388308538324]
This paper introduces ProactiveMobile, a benchmark for proactive mobile agent development.<n>It formalizes the proactive task as inferring latent user intent across four dimensions of on-device contextual signals.<n>The benchmark achieves a success rate of 19.15%, outperforming o1 (15.71%) and GPT-5 (7.39%) in experiments.
arXiv Detail & Related papers (2026-02-25T12:32:37Z) - Multi-Agent LLM Committees for Autonomous Software Beta Testing [0.0]
The framework combines model diversity, persona-driven behavioral variation, and visual user interface understanding.<n>Vision-enabled agents successfully identify user interface elements, with navigation and reporting achieving 100 percent success.<n>The framework enables reproducible research and practical deployment of LLM-based software testing in CI/CD pipelines.
arXiv Detail & Related papers (2025-12-21T02:06:53Z) - Catching UX Flaws in Code: Leveraging LLMs to Identify Usability Flaws at the Development Stage [0.0]
This paper investigates whether large language models (LLMs) can provide reliable and consistent assessments at the development stage.<n>We generated over 850 evaluations in three independent evaluations per site using a pipeline of OpenAI's GPT-4o.<n>For issue detection, the model demonstrated moderate consistency, with an average pairwise Cohen's Kappa of 0.50 and an exact agreement of 84%.
arXiv Detail & Related papers (2025-12-03T21:02:54Z) - Testing the Testers: Human-Driven Quality Assessment of Voice AI Testing Platforms [0.0]
We present the first systematic framework for evaluating voice AI testing quality through human-centered benchmarking.<n>Our methodology addresses the fundamental dual challenge of testing platforms: generating realistic test conversations (evaluation quality) and accurately evaluating agent responses (simulation quality)
arXiv Detail & Related papers (2025-11-06T07:22:58Z) - Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains [97.5573252172065]
We train a family of Automatic Reasoning Evaluators (FARE) with a simple iterative rejection-sampling supervised finetuning approach.<n>FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators.<n>As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH.
arXiv Detail & Related papers (2025-10-20T17:52:06Z) - How can we assess human-agent interactions? Case studies in software agent design [52.953425368394306]
We make two major steps towards the rigorous assessment of human-agent interactions.<n>We propose PULSE, a framework for more efficient human-centric evaluation of agent designs.<n>We deploy the framework on a large-scale web platform built around the open-source software agent OpenHands.
arXiv Detail & Related papers (2025-10-10T19:04:28Z) - LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads? [5.835205320809048]
LiveOIBench is a benchmark featuring 403 Olympiad-level competitive programming problems with an average of 60 expert-designed test cases.<n>The problems are sourced directly from 72 official Informatics Olympiads in different regions conducted between 2023 and 2025.<n>LiveOIBench distinguishes itself through four key features: meticulously curated high-quality tasks with detailed subtasks and extensive private test cases.
arXiv Detail & Related papers (2025-10-10T17:54:24Z) - OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z) - The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z) - Preference Optimization for Reasoning with Pseudo Feedback [100.62603571434167]
We introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions as an evaluation against associated test cases.<n>We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks.
arXiv Detail & Related papers (2024-11-25T12:44:02Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.