UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation
- URL: http://arxiv.org/abs/2511.08195v2
- Date: Fri, 14 Nov 2025 10:44:26 GMT
- Title: UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation
- Authors: Zhen Yang, Wenyi Hong, Mingde Xu, Xinyue Fan, Weihan Wang, Jiele Cheng, Xiaotao Gu, Jie Tang,
- Abstract summary: We present UI2Code$textN$, a visual language model trained through staged pretraining, fine-tuning, and reinforcement learning.<n>The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing.<n>Experiments on UI-to-code and UI polishing benchmarks show that UI2Code$textN$ establishes a new state of the art among open-source models.
- Score: 29.248471527003915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: User interface (UI) programming is a core yet highly complex part of modern software development. Recent advances in visual language models (VLMs) highlight the potential of automatic UI coding, but current approaches face two key limitations: multimodal coding capabilities remain underdeveloped, and single-turn paradigms make little use of iterative visual feedback. We address these challenges with an interactive UI-to-code paradigm that better reflects real-world workflows and raises the upper bound of achievable performance. Under this paradigm, we present UI2Code$^\text{N}$, a visual language model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding. The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing. We further explore test-time scaling for interactive generation, enabling systematic use of multi-turn feedback. Experiments on UI-to-code and UI polishing benchmarks show that UI2Code$^\text{N}$ establishes a new state of the art among open-source models and achieves performance comparable to leading closed-source models such as Claude-4-Sonnet and GPT-5. Our code and models are available at https://github.com/zai-org/UI2Code_N.
Related papers
- FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback [92.67587639164908]
We present FronTalk, a benchmark for front-end code generation with multi-modal feedback.<n>We focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues.<n> Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature.
arXiv Detail & Related papers (2025-12-05T23:28:09Z) - WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation [30.193562985137813]
We propose WebVIA, the first agentic framework for interactive UI-to-Code generation and validation.<n>The framework comprises three components: 1) an exploration agent to capture multi-state UI screenshots; 2) a UI2Code model that generates executable interactive code; 3) a validation module that verifies the interactivity.
arXiv Detail & Related papers (2025-11-09T06:58:52Z) - JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence [48.39202336809688]
We introduce a complete synthesis toolkit to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations.<n>This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both.<n>Our 7B to 14B scale models approaching or even exceeding the performance of commercial models.
arXiv Detail & Related papers (2025-10-27T17:13:49Z) - ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents [40.697759330690815]
ScreenCoder is a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation.<n>By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches.<n>Our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness.
arXiv Detail & Related papers (2025-07-30T16:41:21Z) - DesignCoder: Hierarchy-Aware and Self-Correcting UI Code Generation with Large Language Models [17.348284143568282]
DesignCoder is a novel hierarchical-aware and self-correcting automated code generation framework.<n>We introduce UI Grouping Chains, which enhance MLLMs' capability to understand and predict complex nested UI hierarchies.<n>We also incorporate a self-correction mechanism to improve the model's ability to identify and rectify errors in the generated code.
arXiv Detail & Related papers (2025-06-16T16:20:43Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - MIO: A Foundation Model on Multimodal Tokens [75.45159462076937]
We introduce MIO, a novel foundation model built on multimodal tokens.<n>MIO is capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner.
arXiv Detail & Related papers (2024-09-26T09:57:16Z) - Reinforced UI Instruction Grounding: Towards a Generic UI Task
Automation API [17.991044940694778]
We build a multimodal model to ground natural language instructions in given UI screenshots as a generic UI task automation executor.
To facilitate the exploitation of image-to-text pretrained knowledge, we follow the pixel-to-sequence paradigm.
Our proposed reinforced UI instruction grounding model outperforms the state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2023-10-07T07:22:41Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - Language Models are General-Purpose Interfaces [109.45478241369655]
We propose to use language models as a general-purpose interface to various foundation models.
A collection of pretrained encoders perceive diverse modalities (such as vision, and language)
We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders.
arXiv Detail & Related papers (2022-06-13T17:34:22Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.