WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning
- URL: http://arxiv.org/abs/2509.22644v1
- Date: Fri, 26 Sep 2025 17:59:51 GMT
- Title: WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning
- Authors: Zimu Lu, Houxing Ren, Yunqiao Yang, Ke Wang, Zhuofan Zong, Junting Pan, Mingjie Zhan, Hongsheng Li,
- Abstract summary: WebGen-Agent is a novel website-generation agent that leverages comprehensive and multi-level visual feedback.<n>We introduce textitStep-GRPO with Screenshot and GUI-agent Feedback to improve the ability of LLMs to act as the reasoning engine of WebGen-Agent.<n>WebGen-Agent increases the accuracy of Claude-3.5-Sonnet from 26.4% to 51.9% and its appearance score from 3.0 to 3.9, outperforming the previous state-of-the-art agent system.
- Score: 51.14454312533818
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Agent systems powered by large language models (LLMs) have demonstrated impressive performance on repository-level code-generation tasks. However, for tasks such as website codebase generation, which depend heavily on visual effects and user-interaction feedback, current code agents rely only on simple code execution for feedback and verification. This approach fails to capture the actual quality of the generated code. In this paper, we propose WebGen-Agent, a novel website-generation agent that leverages comprehensive and multi-level visual feedback to iteratively generate and refine the website codebase. Detailed and expressive text descriptions and suggestions regarding the screenshots and GUI-agent testing of the websites are generated by a visual language model (VLM), together with scores that quantify their quality. The screenshot and GUI-agent scores are further integrated with a backtracking and select-best mechanism, enhancing the performance of the agent. Utilizing the accurate visual scores inherent in the WebGen-Agent workflow, we further introduce \textit{Step-GRPO with Screenshot and GUI-agent Feedback} to improve the ability of LLMs to act as the reasoning engine of WebGen-Agent. By using the screenshot and GUI-agent scores at each step as the reward in Step-GRPO, we provide a dense and reliable process supervision signal, which effectively improves the model's website-generation ability. On the WebGen-Bench dataset, WebGen-Agent increases the accuracy of Claude-3.5-Sonnet from 26.4% to 51.9% and its appearance score from 3.0 to 3.9, outperforming the previous state-of-the-art agent system. Additionally, our Step-GRPO training approach increases the accuracy of Qwen2.5-Coder-7B-Instruct from 38.9% to 45.4% and raises the appearance score from 3.4 to 3.7.
Related papers
- GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning [54.42973725693]
We introduce GenAgent, unifying visual understanding and generation through an agentic multimodal model.<n>GenAgent significantly boosts base generator(FLUX.1-dev) performance on GenEval++ and WISE.<n>Our framework demonstrates three key properties: 1) cross-tool generalization to generators with varying capabilities, 2) test-time scaling with consistent improvements across interaction rounds, and 3) task-adaptive reasoning that automatically adjusts to different tasks.
arXiv Detail & Related papers (2026-01-26T14:49:04Z) - It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents [52.81924177620322]
Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking.<n>Their reliance on dynamic web content makes them vulnerable to prompt injection attacks: adversarial instructions hidden in interface elements that persuade the agent to divert from its original task.<n>We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), an evaluation for studying how persuasion techniques misguide autonomous web agents on realistic tasks.
arXiv Detail & Related papers (2025-12-29T01:09:10Z) - WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch [35.609235867316734]
We introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file websites from scratch.<n>It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o.<n>We use GPT-4o to generate test cases targeting each functionality described in the instructions, and then manually filter, adjust, and organize them to ensure accuracy, resulting in 647 test cases.
arXiv Detail & Related papers (2025-05-06T17:59:15Z) - A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models [2.919625687404969]
This paper introduces a novel multi-Agent framework that automates the end to end production of Qinqiang opera by integrating Large Language Models, visual generation, and Text to Speech synthesis.<n>In a case study on Dou E Yuan, the system achieved expert ratings of 3.8 for script fidelity, 3.5 for visual coherence, and 3.8 for speech accuracy-culminating in an overall score of 3.6, a 0.3 point improvement over a Single Agent baseline.
arXiv Detail & Related papers (2025-04-22T03:14:29Z) - Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents [30.253353551910404]
Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices.<n>We introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models.<n>Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks.
arXiv Detail & Related papers (2025-04-01T15:40:27Z) - PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC [98.82146219495792]
In this paper, we propose a hierarchical agent framework named PC-Agent.<n>From the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content.<n>From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture.
arXiv Detail & Related papers (2025-02-20T05:41:55Z) - nvAgent: Automated Data Visualization from Natural Language via Collaborative Agent Workflow [9.676697360425196]
Natural Language to Visualization (NL2Vis) seeks to convert natural-language descriptions into visual representations of given tables.<n>We propose a collaborative agent workflow, termed nvAgent, for NL2Vis.<n> Comprehensive evaluations on the new VisEval benchmark demonstrate that nvAgent consistently surpasses state-of-the-art baselines.
arXiv Detail & Related papers (2025-02-07T16:03:08Z) - PlotGen: Multi-Agent LLM-based Scientific Data Visualization via Multimodal Feedback [47.79080056618323]
We propose PlotGen, a novel multi-agent framework aimed at the creation of precise scientific visualizations.<n>PlotGen orchestrates multiple.<n>retrieval agents, including a Query Planning Agent that breaks.<n>down complex user requests into executable code, and three.<n>retrieval feedback agents.<n>Experiments show that PlotGen outperforms strong baselines, achieving a 4-6 percent improvement on the MatBench dataset.
arXiv Detail & Related papers (2025-02-03T02:00:29Z) - AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials [53.376263056033046]
Existing approaches rely on expensive human annotation, making them unsustainable at scale.<n>We propose AgentTrek, a scalable data synthesis pipeline that generates web agent trajectories by leveraging publicly available tutorials.<n>Our fully automated approach significantly reduces data collection costs, achieving a cost of just $0.55 per high-quality trajectory without human annotators.
arXiv Detail & Related papers (2024-12-12T18:59:27Z) - Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.<n>We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.<n>We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z) - CogAgent: A Visual Language Model for GUI Agents [59.9232825236743]
We introduce CogAgent, a visual language model (VLM) specializing in GUI understanding and navigation.<n>By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120.<n>CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE.
arXiv Detail & Related papers (2023-12-14T13:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.