Related papers: WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation

WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation

URL: http://arxiv.org/abs/2511.06251v1
Date: Sun, 09 Nov 2025 06:58:52 GMT
Title: WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation
Authors: Mingde Xu, Zhen Yang, Wenyi Hong, Lihang Pan, Xinyue Fan, Yan Wang, Xiaotao Gu, Bin Xu, Jie Tang,
Abstract summary: We propose WebVIA, the first agentic framework for interactive UI-to-Code generation and validation.<n>The framework comprises three components: 1) an exploration agent to capture multi-state UI screenshots; 2) a UI2Code model that generates executable interactive code; 3) a validation module that verifies the interactivity.
Score: 30.193562985137813
License: http://creativecommons.org/licenses/by/4.0/
Abstract: User interface (UI) development requires translating design mockups into functional code, a process that remains repetitive and labor-intensive. While recent Vision-Language Models (VLMs) automate UI-to-Code generation, they generate only static HTML/CSS/JavaScript layouts lacking interactivity. To address this, we propose WebVIA, the first agentic framework for interactive UI-to-Code generation and validation. The framework comprises three components: 1) an exploration agent to capture multi-state UI screenshots; 2) a UI2Code model that generates executable interactive code; 3) a validation module that verifies the interactivity. Experiments demonstrate that WebVIA-Agent achieves more stable and accurate UI exploration than general-purpose agents (e.g., Gemini-2.5-Pro). In addition, our fine-tuned WebVIA-UI2Code models exhibit substantial improvements in generating executable and interactive HTML/CSS/JavaScript code, outperforming their base counterparts across both interactive and static UI2Code benchmarks. Our code and models are available at \href{https://zheny2751-dotcom.github.io/webvia.github.io/}{\texttt{https://webvia.github.io}}.

Related papers

EmbeWebAgent: Embedding Web Agents into Any Customized UI [3.034887612600091]
We present EmbeWebAgent, a framework for embedding agents directly into existing UIs.<n>It supports mixed-granularity actions ranging from primitives to higher-level composites.<n>Our demo shows minimal retrofitting effort and robust multi-step behaviors grounded in a live UI setting.
arXiv Detail & Related papers (2026-02-16T15:59:56Z)
FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback [92.67587639164908]
We present FronTalk, a benchmark for front-end code generation with multi-modal feedback.<n>We focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues.<n> Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature.
arXiv Detail & Related papers (2025-12-05T23:28:09Z)
UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation [29.248471527003915]
We present UI2Code$textN$, a visual language model trained through staged pretraining, fine-tuning, and reinforcement learning.<n>The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing.<n>Experiments on UI-to-code and UI polishing benchmarks show that UI2Code$textN$ establishes a new state of the art among open-source models.
arXiv Detail & Related papers (2025-11-11T13:00:09Z)
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents [40.697759330690815]
ScreenCoder is a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation.<n>By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches.<n>Our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness.
arXiv Detail & Related papers (2025-07-30T16:41:21Z)
UICopilot: Automating UI Synthesis via Hierarchical Code Generation from Webpage Designs [43.006316221657904]
This paper proposes a novel approach to automating the synthesis of User Interfaces (UIs) via hierarchical code generation from webpage designs.<n>The core idea of UICopilot is to decompose the generation process into two stages: first, generating the coarse-grained HTML structure, followed by the generation of fine-grained code.<n> Experimental results demonstrate that UICopilot significantly outperforms existing baselines in both automatic evaluation and human evaluations.
arXiv Detail & Related papers (2025-05-15T02:09:54Z)
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection [38.833925781308665]
We introduce textitInfiGUIAgent, an MLLM-based GUI Agent trained with a two-stage supervised fine-tuning pipeline.<n> Stage 1 enhances fundamental skills such as GUI understanding and grounding, while Stage 2 integrates hierarchical reasoning and expectation-reflection reasoning skills.<n>textitInfiGUIAgent achieves competitive performance on several GUI benchmarks.
arXiv Detail & Related papers (2025-01-08T15:45:21Z)
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations. ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z)
Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping [57.024913536420264]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance on the design-to-code task.<n>We present the first systematic investigation of MLLMs in generating interactive webpages.
arXiv Detail & Related papers (2024-11-05T17:40:03Z)
Beyond Browsing: API-Based Web Agents [58.39129004543844]
API-Based Agents outperform web Browsing Agents in experiments on WebArena.<n>Hybrid Agents out-perform both others nearly uniformly across tasks.<n>Results strongly suggest that when APIs are available, they present an attractive alternative to relying on web browsing alone.
arXiv Detail & Related papers (2024-10-21T19:46:06Z)
Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach [51.522121376987634]
We propose DCGen, a divide-and-based approach to automate the translation of webpage design to UI code.<n>We show that DCGen achieves up to a 15% improvement in visual similarity and 8% in code similarity for large input images.<n>Human evaluations show that DCGen can help developers implement webpages significantly faster and more similar to the UI designs.
arXiv Detail & Related papers (2024-06-24T07:58:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.