You Don't Know Until You Click:Automated GUI Testing for Production-Ready Software Evaluation
- URL: http://arxiv.org/abs/2508.14104v1
- Date: Sun, 17 Aug 2025 07:31:11 GMT
- Title: You Don't Know Until You Click:Automated GUI Testing for Production-Ready Software Evaluation
- Authors: Yutong Bian, Xianhao Lin, Yupeng Xie, Tianyang Liu, Mingchen Zhuge, Siyuan Lu, Haoming Tang, Jinlin Wang, Jiayi Zhang, Jiaqi Chen, Xiangru Tang, Yongxin Ni, Sirui Hong, Chenglin Wu,
- Abstract summary: RealDevWorld is an evaluation framework for large language models (LLMs) and code agents in software development.<n>It features two key components: RealDevBench, a collection of 194 open-ended software engineering tasks, and AppEvalPilot, a new agent-as-a-judge evaluation system.<n> Empirical results show that RealDevWorld delivers effective, automatic, and human-aligned evaluations.
- Score: 24.956175875766952
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) and code agents in software development are rapidly evolving from generating isolated code snippets to producing full-fledged software applications with graphical interfaces, interactive logic, and dynamic behaviors. However, current benchmarks fall short in evaluating such production-ready software, as they often rely on static checks or binary pass/fail scripts, failing to capture the interactive behaviors and runtime dynamics that define real-world usability - qualities that only emerge when an application is actively used. This is the blind spot of current evaluation: you don't know if an app works until you click through it, interact with it, and observe how it responds. To bridge this gap, we introduce RealDevWorld, a novel evaluation framework for automated end-to-end assessment of LLMs' ability to generate production-ready repositories from scratch. It features two key components: (1) RealDevBench, a diverse collection of 194 open-ended software engineering tasks across multiple domains, incorporating multimodal elements to reflect real-world complexity; and (2) AppEvalPilot, a new agent-as-a-judge evaluation system that simulates realistic, GUI-based user interactions to automatically and holistically assess software functional correctness, visual fidelity, and runtime behavior. The framework delivers fine-grained, task-specific diagnostic feedback, supporting nuanced evaluation beyond simple success/failure judgments. Empirical results show that RealDevWorld delivers effective, automatic, and human-aligned evaluations, achieving an accuracy of 0.92 and a correlation of 0.85 with expert human assessments, while significantly reducing the reliance on manual review. This enables scalable, human-aligned assessment of production-level software generated by LLMs. Our code is available on GitHub.
Related papers
- LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering [90.84806758077536]
We introduce textbfLoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess large language models (LLMs) agents in realistic, long-context software engineering.<n>Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations.<n>Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens.
arXiv Detail & Related papers (2025-11-17T23:57:24Z) - InspectCoder: Dynamic Analysis-Enabled Self Repair through interactive LLM-Debugger Collaboration [71.18377595277018]
Large Language Models (LLMs) frequently generate buggy code with complex logic errors that are challenging to diagnose.<n>We present InspectCoder, the first agentic program repair system that empowers LLMs to actively conduct dynamic analysis via interactive debugger control.
arXiv Detail & Related papers (2025-10-21T06:26:29Z) - AppForge: From Assistant to Independent Developer - Are GPTs Ready for Software Development? [28.63033734662797]
APPFORGE is a benchmark consisting of 101 software development problems drawn from real-world Android apps.<n>We design a multi-agent system to automatically summarize the main functionalities from app documents and navigate the app to synthesize test cases.<n>Following rigorous manual verification by Android development experts, APPFORGE incorporates the test cases within an automated evaluation framework.
arXiv Detail & Related papers (2025-10-09T03:26:05Z) - FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding [11.846768103642583]
FeatBench is a novel benchmark for vibe coding that focuses on feature implementation.<n> FeatBench is built on a multi-level filtering pipeline to ensure quality and a fully automated pipeline to evolve the benchmark.<n>Our evaluation reveals that feature implementation within the vibe coding paradigm is a significant challenge, with the highest success rate of only 29.94%.
arXiv Detail & Related papers (2025-09-26T11:47:50Z) - Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes [33.80591142965565]
We present CODE2BENCH, a pipeline for dynamically constructing robust and contamination-resistant benchmarks from real-world GitHub repositories.<n>Specifically, CODE2BENCH introduces three key innovations: (1) Automated Dynamism, achieved through periodic ingestion of recent code to minimize training data contamination; (2) Scope Graph-based dependency analysis, which enables structured classification of functions into benchmark instances with controlled dependency levels; and (3) Property-Based Testing (PBT) for the automated synthesis of rigorous test suites.
arXiv Detail & Related papers (2025-08-10T05:06:36Z) - Automated Validation of LLM-based Evaluators for Software Engineering Artifacts [0.7548538278943616]
REFINE (Ranking Evaluators for FIne grained Nuanced Evaluation) is an automated framework for benchmarking large language models (LLMs)<n> REFINE applies novel generation techniques to automatically synthesize artifacts with progressively reduced quality.<n>It quantifies each candidate evaluator configuration by measuring how closely its rankings align with expected ordering.
arXiv Detail & Related papers (2025-08-04T18:52:01Z) - ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [48.24550684610705]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z) - Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation [83.92224427735859]
We introduce a pre-operative critic mechanism that provides effective feedback prior to the actual execution.<n>We develop a reasoning-bootstrapping based data collection pipeline to create a GUI-Critic-Train and a GUI-Critic-Test.<n>Our model offers significant advantages in critic accuracy compared to current MLLMs.
arXiv Detail & Related papers (2025-06-05T04:12:36Z) - Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs [63.10710876536337]
We propose an offline simulation framework to curate a software-specific skillset, a collection of verified scripts.<n>Our framework comprises two components: (1) task creation, using top-down functionality and bottom-up API synergy exploration to generate helpful tasks.<n> Experiments with Adobe Illustrator demonstrate that our framework significantly improves automation success rates, reduces response time, and saves runtime token costs.
arXiv Detail & Related papers (2025-04-29T04:03:37Z) - TestAgent: A Framework for Domain-Adaptive Evaluation of LLMs via Dynamic Benchmark Construction and Exploratory Interaction [29.72874725703848]
Large language models (LLMs) are increasingly deployed to various vertical domains.<n>Current evaluation methods rely on static and resource-intensive datasets that are not aligned with real-world requirements.<n>We introduce two key concepts: textbfBenchmark+, which extends the traditional question-answer benchmark into a more flexible strategy-criterion'' format.<n>We propose textbftextscTestAgent, an agent-based evaluation framework that implements these concepts using retrieval-augmented generation and reinforcement learning.
arXiv Detail & Related papers (2024-10-15T11:20:42Z) - Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction [28.53259866617677]
We introduce Mobile-Env, a comprehensive toolkit tailored for creating GUI benchmarks in the Android mobile environment.
We collect an open-world task set across various real-world apps and a fixed world set, WikiHow, which captures a significant amount of dynamic online contents.
Our findings reveal that even advanced models struggle with tasks that are relatively simple for humans.
arXiv Detail & Related papers (2023-05-14T12:31:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.