Related papers: Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

URL: http://arxiv.org/abs/2603.04601v1
Date: Wed, 04 Mar 2026 21:00:33 GMT
Title: Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
Authors: Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu,
Abstract summary: Existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch.<n>We introduce Vibe Code Bench, a benchmark of 100 web application specifications with 964 browser-based substeps.<n>Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results.
Score: 6.072381417546439
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves only 58.0% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level agreement). Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results.

Related papers

ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices [17.39388308538324]
This paper introduces ProactiveMobile, a benchmark for proactive mobile agent development.<n>It formalizes the proactive task as inferring latent user intent across four dimensions of on-device contextual signals.<n>The benchmark achieves a success rate of 19.15%, outperforming o1 (15.71%) and GPT-5 (7.39%) in experiments.
arXiv Detail & Related papers (2026-02-25T12:32:37Z)
Multi-Agent LLM Committees for Autonomous Software Beta Testing [0.0]
The framework combines model diversity, persona-driven behavioral variation, and visual user interface understanding.<n>Vision-enabled agents successfully identify user interface elements, with navigation and reporting achieving 100 percent success.<n>The framework enables reproducible research and practical deployment of LLM-based software testing in CI/CD pipelines.
arXiv Detail & Related papers (2025-12-21T02:06:53Z)
Catching UX Flaws in Code: Leveraging LLMs to Identify Usability Flaws at the Development Stage [0.0]
This paper investigates whether large language models (LLMs) can provide reliable and consistent assessments at the development stage.<n>We generated over 850 evaluations in three independent evaluations per site using a pipeline of OpenAI's GPT-4o.<n>For issue detection, the model demonstrated moderate consistency, with an average pairwise Cohen's Kappa of 0.50 and an exact agreement of 84%.
arXiv Detail & Related papers (2025-12-03T21:02:54Z)
Testing the Testers: Human-Driven Quality Assessment of Voice AI Testing Platforms [0.0]
We present the first systematic framework for evaluating voice AI testing quality through human-centered benchmarking.<n>Our methodology addresses the fundamental dual challenge of testing platforms: generating realistic test conversations (evaluation quality) and accurately evaluating agent responses (simulation quality)
arXiv Detail & Related papers (2025-11-06T07:22:58Z)
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains [97.5573252172065]
We train a family of Automatic Reasoning Evaluators (FARE) with a simple iterative rejection-sampling supervised finetuning approach.<n>FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators.<n>As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH.
arXiv Detail & Related papers (2025-10-20T17:52:06Z)
How can we assess human-agent interactions? Case studies in software agent design [52.953425368394306]
We make two major steps towards the rigorous assessment of human-agent interactions.<n>We propose PULSE, a framework for more efficient human-centric evaluation of agent designs.<n>We deploy the framework on a large-scale web platform built around the open-source software agent OpenHands.
arXiv Detail & Related papers (2025-10-10T19:04:28Z)
LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads? [5.835205320809048]
LiveOIBench is a benchmark featuring 403 Olympiad-level competitive programming problems with an average of 60 expert-designed test cases.<n>The problems are sourced directly from 72 official Informatics Olympiads in different regions conducted between 2023 and 2025.<n>LiveOIBench distinguishes itself through four key features: meticulously curated high-quality tasks with detailed subtasks and extensive private test cases.
arXiv Detail & Related papers (2025-10-10T17:54:24Z)
OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z)
The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z)
Preference Optimization for Reasoning with Pseudo Feedback [100.62603571434167]
We introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions as an evaluation against associated test cases.<n>We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks.
arXiv Detail & Related papers (2024-11-25T12:44:02Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.