From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems
- URL: http://arxiv.org/abs/2512.18080v1
- Date: Fri, 19 Dec 2025 21:37:15 GMT
- Title: From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems
- Authors: Marcos Ortiz, Justin Hill, Collin Overbay, Ingrida Semenec, Frederic Sauve-Hoover, Jim Schwoebel, Joel Shor,
- Abstract summary: Agentic AI systems capable of generating full-stack web applications from natural language prompts represent a significant shift in software development.<n>It is unclear how existing prompt-to-app tools compare under realistic, human-centered evaluation criteria.<n>We introduce a human-centered benchmark for evaluating prompt-to-app systems and conduct a large-scale comparative study of three widely used platforms.
- Score: 1.2273967746497585
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Agentic AI systems capable of generating full-stack web applications from natural language prompts ("prompt- to-app") represent a significant shift in software development. However, evaluating these systems remains challenging, as visual polish, functional correctness, and user trust are often misaligned. As a result, it is unclear how existing prompt-to-app tools compare under realistic, human-centered evaluation criteria. In this paper, we introduce a human-centered benchmark for evaluating prompt-to-app systems and conduct a large-scale comparative study of three widely used platforms: Replit, Bolt, and Firebase Studio. Using a diverse set of 96 prompts spanning common web application tasks, we generate 288 unique application artifacts. We evaluate these systems through a large-scale human-rater study involving 205 participants and 1,071 quality-filtered pairwise comparisons, assessing task-based ease of use, visual appeal, perceived completeness, and user trust. Our results show that these systems are not interchangeable: Firebase Studio consistently outperforms competing platforms across all human-evaluated dimensions, achieving the highest win rates for ease of use, trust, visual appeal, and visual appropriateness. Bolt performs competitively on visual appeal but trails Firebase on usability and trust, while Replit underperforms relative to both across most metrics. These findings highlight a persistent gap between visual polish and functional reliability in prompt-to-app systems and demonstrate the necessity of interactive, task-based evaluation. We release our benchmark framework, prompt set, and generated artifacts to support reproducible evaluation and future research in agentic application generation.
Related papers
- DEEP: Docker-based Execution and Evaluation Platform [0.8666275811953877]
Our proposed software (DEEP) automates the execution and scoring of machine translation and optical character recognition models.<n>DEEP is prepared to receive dockerized systems, run them (extracting information at that same time), and assess hypothesis against some references.<n>It uses a clustering algorithm based on a statistical analysis of the significance of the results yielded by each model.
arXiv Detail & Related papers (2026-02-23T08:08:57Z) - FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback [92.67587639164908]
We present FronTalk, a benchmark for front-end code generation with multi-modal feedback.<n>We focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues.<n> Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature.
arXiv Detail & Related papers (2025-12-05T23:28:09Z) - Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation [5.332969177132911]
Current benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, are predominantly derived from GitHub issues.<n>We introduce a novel benchmarking framework that transforms existing formal benchmarks into realistic user queries.
arXiv Detail & Related papers (2025-10-10T04:42:02Z) - You Don't Know Until You Click:Automated GUI Testing for Production-Ready Software Evaluation [24.956175875766952]
RealDevWorld is an evaluation framework for large language models (LLMs) and code agents in software development.<n>It features two key components: RealDevBench, a collection of 194 open-ended software engineering tasks, and AppEvalPilot, a new agent-as-a-judge evaluation system.<n> Empirical results show that RealDevWorld delivers effective, automatic, and human-aligned evaluations.
arXiv Detail & Related papers (2025-08-17T07:31:11Z) - Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z) - ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [51.297873393639456]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z) - RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios [60.772871735598706]
RefHCM (Referring Human-Centric Model) is a framework to integrate a wide range of human-centric referring tasks.<n> RefHCM employs sequence mergers to convert raw multimodal data -- including images, text, coordinates, and parsing maps -- into semantic tokens.<n>This work represents the first attempt to address referring human perceptions with a general-purpose framework.
arXiv Detail & Related papers (2024-12-19T08:51:57Z) - The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z) - Benchopt: Reproducible, efficient and collaborative optimization
benchmarks [67.29240500171532]
Benchopt is a framework to automate, reproduce and publish optimization benchmarks in machine learning.
Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments.
arXiv Detail & Related papers (2022-06-27T16:19:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.