Related papers: FullStack-Agent: Enhancing Agentic Full-Stack Web Coding via Development-Oriented Testing and Repository Back-Translation

FullStack-Agent: Enhancing Agentic Full-Stack Web Coding via Development-Oriented Testing and Repository Back-Translation

URL: http://arxiv.org/abs/2602.03798v1
Date: Tue, 03 Feb 2026 18:01:34 GMT
Title: FullStack-Agent: Enhancing Agentic Full-Stack Web Coding via Development-Oriented Testing and Repository Back-Translation
Authors: Zimu Lu, Houxing Ren, Yunqiao Yang, Ke Wang, Zhuofan Zong, Mingjie Zhan, Hongsheng Li,
Abstract summary: FullStack-Agent is a unified agent system for full-stack agentic coding.<n>It consists of three parts: FullStack-Dev, a multi-agent framework with strong planning, code editing, navigation, and bug localization abilities.<n>Our FullStack-Dev outperforms the previous state-of-the-art method by 8.7%, 38.2%, and 15.9% on theBench, backend, and database test cases respectively.
Score: 48.18394873529704
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Assisting non-expert users to develop complex interactive websites has become a popular task for LLM-powered code agents. However, existing code agents tend to only generate frontend web pages, masking the lack of real full-stack data processing and storage with fancy visual effects. Notably, constructing production-level full-stack web applications is far more challenging than only generating frontend web pages, demanding careful control of data flow, comprehensive understanding of constantly updating packages and dependencies, and accurate localization of obscure bugs in the codebase. To address these difficulties, we introduce FullStack-Agent, a unified agent system for full-stack agentic coding that consists of three parts: (1) FullStack-Dev, a multi-agent framework with strong planning, code editing, codebase navigation, and bug localization abilities. (2) FullStack-Learn, an innovative data-scaling and self-improving method that back-translates crawled and synthesized website repositories to improve the backbone LLM of FullStack-Dev. (3) FullStack-Bench, a comprehensive benchmark that systematically tests the frontend, backend and database functionalities of the generated website. Our FullStack-Dev outperforms the previous state-of-the-art method by 8.7%, 38.2%, and 15.9% on the frontend, backend, and database test cases respectively. Additionally, FullStack-Learn raises the performance of a 30B model by 9.7%, 9.5%, and 2.8% on the three sets of test cases through self-improvement, demonstrating the effectiveness of our approach. The code is released at https://github.com/mnluzimu/FullStack-Agent.

Related papers

ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development [49.63491095660809]
ProjDevBench is an end-to-end benchmark that provides project requirements to coding agents and evaluates the resulting repositories.<n>We curate 20 programming problems across 8 categories, covering both concept-oriented tasks and real-world application scenarios.<n>Our evaluation reports an overall acceptance rate of 27.38%: agents handle basic functionality but struggle with complex system design, time optimization, and resource management.
arXiv Detail & Related papers (2026-02-02T05:17:23Z)
Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents [70.77400371166922]
Deep research web agents need to rigorously analyze and aggregate knowledge for insightful research.<n>We propose an Explore to Evolve paradigm to scalably construct verifiable training data for web agents.<n>Based on an open-source agent framework, SmolAgents, we collect supervised fine-tuning trajectories to develop a series of foundation models.
arXiv Detail & Related papers (2025-10-16T08:37:42Z)
Where LLM Agents Fail and How They can Learn From Failures [62.196870049524364]
Large Language Model (LLM) agents have shown promise in solving complex, multi-step tasks.<n>They amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions.<n>Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way.<n>We introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations.
arXiv Detail & Related papers (2025-09-29T18:20:27Z)
WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning [51.14454312533818]
WebGen-Agent is a novel website-generation agent that leverages comprehensive and multi-level visual feedback.<n>We introduce textitStep-GRPO with Screenshot and GUI-agent Feedback to improve the ability of LLMs to act as the reasoning engine of WebGen-Agent.<n>WebGen-Agent increases the accuracy of Claude-3.5-Sonnet from 26.4% to 51.9% and its appearance score from 3.0 to 3.9, outperforming the previous state-of-the-art agent system.
arXiv Detail & Related papers (2025-09-26T17:59:51Z)
WebLists: Extracting Structured Information From Complex Interactive Websites Using Executable LLM Agents [1.6673034682613495]
We present WebLists, a benchmark of 200 data-extraction tasks across four common business and enterprise use-cases.<n>We show that both LLMs with search capabilities and SOTA web agents struggle with these tasks, with a recall of 3% and 31%, respectively.<n>We propose BardeenAgent, a novel framework that enables web agents to convert their execution into repeatable programs, and replay them at scale across pages with similar structure.<n>On the WebLists benchmark, BardeenAgent achieves 66% recall overall, more than doubling the performance of SOTA web agents, and reducing cost per output row by 3x.
arXiv Detail & Related papers (2025-04-17T06:16:40Z)
FullStack Bench: Evaluating LLMs as Full Stack Coders [111.95274348807051]
FullStack Bench focuses on full-stack programming, which encompasses a wide range of application domains.<n>To assess multilingual programming capabilities, in FullStack Bench, we design real-world instructions and corresponding unit test cases from 16 widely-used programming languages.
arXiv Detail & Related papers (2024-11-30T16:58:42Z)
Mind2Web: Towards a Generalist Agent for the Web [25.363429937913065]
Mind2Web is the first dataset for developing and evaluating generalist agents for the web. With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains, Mind2Web provides three necessary ingredients for building generalist web agents. Based on Mind2Web, we conduct an initial exploration of using large language models (LLMs) for building generalist web agents.
arXiv Detail & Related papers (2023-06-09T17:44:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.