WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation
- URL: http://arxiv.org/abs/2511.06251v1
- Date: Sun, 09 Nov 2025 06:58:52 GMT
- Title: WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation
- Authors: Mingde Xu, Zhen Yang, Wenyi Hong, Lihang Pan, Xinyue Fan, Yan Wang, Xiaotao Gu, Bin Xu, Jie Tang,
- Abstract summary: We propose WebVIA, the first agentic framework for interactive UI-to-Code generation and validation.<n>The framework comprises three components: 1) an exploration agent to capture multi-state UI screenshots; 2) a UI2Code model that generates executable interactive code; 3) a validation module that verifies the interactivity.
- Score: 30.193562985137813
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: User interface (UI) development requires translating design mockups into functional code, a process that remains repetitive and labor-intensive. While recent Vision-Language Models (VLMs) automate UI-to-Code generation, they generate only static HTML/CSS/JavaScript layouts lacking interactivity. To address this, we propose WebVIA, the first agentic framework for interactive UI-to-Code generation and validation. The framework comprises three components: 1) an exploration agent to capture multi-state UI screenshots; 2) a UI2Code model that generates executable interactive code; 3) a validation module that verifies the interactivity. Experiments demonstrate that WebVIA-Agent achieves more stable and accurate UI exploration than general-purpose agents (e.g., Gemini-2.5-Pro). In addition, our fine-tuned WebVIA-UI2Code models exhibit substantial improvements in generating executable and interactive HTML/CSS/JavaScript code, outperforming their base counterparts across both interactive and static UI2Code benchmarks. Our code and models are available at \href{https://zheny2751-dotcom.github.io/webvia.github.io/}{\texttt{https://webvia.github.io}}.
Related papers
- EmbeWebAgent: Embedding Web Agents into Any Customized UI [3.034887612600091]
We present EmbeWebAgent, a framework for embedding agents directly into existing UIs.<n>It supports mixed-granularity actions ranging from primitives to higher-level composites.<n>Our demo shows minimal retrofitting effort and robust multi-step behaviors grounded in a live UI setting.
arXiv Detail & Related papers (2026-02-16T15:59:56Z) - FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback [92.67587639164908]
We present FronTalk, a benchmark for front-end code generation with multi-modal feedback.<n>We focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues.<n> Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature.
arXiv Detail & Related papers (2025-12-05T23:28:09Z) - UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation [29.248471527003915]
We present UI2Code$textN$, a visual language model trained through staged pretraining, fine-tuning, and reinforcement learning.<n>The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing.<n>Experiments on UI-to-code and UI polishing benchmarks show that UI2Code$textN$ establishes a new state of the art among open-source models.
arXiv Detail & Related papers (2025-11-11T13:00:09Z) - ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents [40.697759330690815]
ScreenCoder is a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation.<n>By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches.<n>Our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness.
arXiv Detail & Related papers (2025-07-30T16:41:21Z) - UICopilot: Automating UI Synthesis via Hierarchical Code Generation from Webpage Designs [43.006316221657904]
This paper proposes a novel approach to automating the synthesis of User Interfaces (UIs) via hierarchical code generation from webpage designs.<n>The core idea of UICopilot is to decompose the generation process into two stages: first, generating the coarse-grained HTML structure, followed by the generation of fine-grained code.<n> Experimental results demonstrate that UICopilot significantly outperforms existing baselines in both automatic evaluation and human evaluations.
arXiv Detail & Related papers (2025-05-15T02:09:54Z) - InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection [38.833925781308665]
We introduce textitInfiGUIAgent, an MLLM-based GUI Agent trained with a two-stage supervised fine-tuning pipeline.<n> Stage 1 enhances fundamental skills such as GUI understanding and grounding, while Stage 2 integrates hierarchical reasoning and expectation-reflection reasoning skills.<n>textitInfiGUIAgent achieves competitive performance on several GUI benchmarks.
arXiv Detail & Related papers (2025-01-08T15:45:21Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping [57.024913536420264]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance on the design-to-code task.<n>We present the first systematic investigation of MLLMs in generating interactive webpages.
arXiv Detail & Related papers (2024-11-05T17:40:03Z) - Beyond Browsing: API-Based Web Agents [58.39129004543844]
API-Based Agents outperform web Browsing Agents in experiments on WebArena.<n>Hybrid Agents out-perform both others nearly uniformly across tasks.<n>Results strongly suggest that when APIs are available, they present an attractive alternative to relying on web browsing alone.
arXiv Detail & Related papers (2024-10-21T19:46:06Z) - Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach [51.522121376987634]
We propose DCGen, a divide-and-based approach to automate the translation of webpage design to UI code.<n>We show that DCGen achieves up to a 15% improvement in visual similarity and 8% in code similarity for large input images.<n>Human evaluations show that DCGen can help developers implement webpages significantly faster and more similar to the UI designs.
arXiv Detail & Related papers (2024-06-24T07:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.