ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
- URL: http://arxiv.org/abs/2507.22827v2
- Date: Mon, 20 Oct 2025 14:13:43 GMT
- Title: ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
- Authors: Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue,
- Abstract summary: ScreenCoder is a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation.<n>By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches.<n>Our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness.
- Score: 40.697759330690815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can translate images to code, they often fail on complex UIs, struggling to unify visual perception, layout planning, and code synthesis within a single monolithic model, which leads to frequent perception and planning errors. To address this, we propose ScreenCoder, a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation. By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches. Furthermore, ScreenCoder serves as a scalable data engine, enabling us to generate high-quality image-code pairs. We use this data to fine-tune open-source MLLM via a dual-stage pipeline of supervised fine-tuning and reinforcement learning, demonstrating substantial gains in its UI generation capabilities. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.
Related papers
- ThinkGen: Generalized Thinking for Visual Generation [97.19923474851987]
ThinkGen is a think-driven visual generation framework that explicitly leverages Chain-of-Thought (CoT) reasoning in various generation scenarios.<n>We propose a separable GRPO-based training paradigm, alternating reinforcement learning between the MLLM and DiT modules.<n>Experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks.
arXiv Detail & Related papers (2025-12-29T16:08:50Z) - VSA:Visual-Structural Alignment for UI-to-Code [29.15071743239679]
We propose bfVSA (VSA), a multi-stage paradigm designed to synthesize organized assets through visual-text alignment.<n>Our framework yields a substantial improvement in code modularity and architectural consistency over state-of-the-art benchmarks.
arXiv Detail & Related papers (2025-12-23T03:55:45Z) - VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning [13.193184888476404]
We introduce textbfciCoder, a unified multimodal code generation model.<n>We begin by constructing a large-scale Supervised Finetuning (SFT) corpus comprising 1.6M image-code pairs.<n>We then introduce a Visual Reinforcement Learning (ViRL) strategy, which employs a coarse-to-fine reward mechanism to improve visual fidelity.
arXiv Detail & Related papers (2025-11-01T04:05:26Z) - JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence [48.39202336809688]
We introduce a complete synthesis toolkit to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations.<n>This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both.<n>Our 7B to 14B scale models approaching or even exceeding the performance of commercial models.
arXiv Detail & Related papers (2025-10-27T17:13:49Z) - VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models [82.05514464090172]
Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding.<n>However, their ability to generate code from multimodal inputs remains limited.<n>We introduce VisCodex, a unified framework that seamlessly merges vision and coding language models.
arXiv Detail & Related papers (2025-08-13T17:00:44Z) - DesignCoder: Hierarchy-Aware and Self-Correcting UI Code Generation with Large Language Models [17.348284143568282]
DesignCoder is a novel hierarchical-aware and self-correcting automated code generation framework.<n>We introduce UI Grouping Chains, which enhance MLLMs' capability to understand and predict complex nested UI hierarchies.<n>We also incorporate a self-correction mechanism to improve the model's ability to identify and rectify errors in the generated code.
arXiv Detail & Related papers (2025-06-16T16:20:43Z) - PreGenie: An Agentic Framework for High-quality Visual Presentation Generation [25.673526096069548]
PreGenie is an agentic and modular framework powered by multimodal large language models (MLLMs) for generating high-quality visual presentations.<n>It operates in two stages: (1) Analysis and Initial Generation, which summarizes multimodal input and generates initial code, and (2) Review and Re-generation, which iteratively reviews intermediate code and rendered slides to produce final, high-quality presentations.
arXiv Detail & Related papers (2025-05-27T18:36:19Z) - UICopilot: Automating UI Synthesis via Hierarchical Code Generation from Webpage Designs [43.006316221657904]
This paper proposes a novel approach to automating the synthesis of User Interfaces (UIs) via hierarchical code generation from webpage designs.<n>The core idea of UICopilot is to decompose the generation process into two stages: first, generating the coarse-grained HTML structure, followed by the generation of fine-grained code.<n> Experimental results demonstrate that UICopilot significantly outperforms existing baselines in both automatic evaluation and human evaluations.
arXiv Detail & Related papers (2025-05-15T02:09:54Z) - UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding [84.87802580670579]
We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations.<n>Our unified visual encoding framework captures both high-level semantics and low-level details, delivering multidimensional information.
arXiv Detail & Related papers (2025-04-06T09:20:49Z) - Boosting Chart-to-Code Generation in MLLM via Dual Preference-Guided Refinement [16.22363384653305]
Multimodal Large Language Models (MLLMs) perform fine-grained visual parsing, precise code synthesis, and robust cross-modal reasoning.<n>We propose a dual preference-guided refinement framework that combines a feedback-driven, dual-modality reward mechanism with iterative preference learning.<n>Our framework significantly enhances the performance of general-purpose open-source MLLMs, enabling them to generate high-quality plotting code.
arXiv Detail & Related papers (2025-04-03T07:51:20Z) - From PowerPoint UI Sketches to Web-Based Applications: Pattern-Driven Code Generation for GIS Dashboard Development Using Knowledge-Augmented LLMs, Context-Aware Visual Prompting, and the React Framework [1.4367082420201918]
This paper introduces a knowledgeaugmented code generation framework for complex GIS applications.<n>The framework retrieves software engineering best practices, domain, and advanced technology stacks from a specialized knowledge base.
arXiv Detail & Related papers (2025-02-12T19:59:57Z) - SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding [66.74446220401296]
We propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation.<n>We introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding.<n>Our code and models shall be released.
arXiv Detail & Related papers (2024-12-12T18:59:26Z) - MageBench: Bridging Large Multimodal Models to Agents [90.59091431806793]
LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents.<n>Existing benchmarks mostly assess their reasoning abilities in language part.<n>MageBench is a reasoning capability oriented multimodal agent benchmark.
arXiv Detail & Related papers (2024-12-05T17:08:19Z) - Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach [51.522121376987634]
We propose DCGen, a divide-and-based approach to automate the translation of webpage design to UI code.<n>We show that DCGen achieves up to a 15% improvement in visual similarity and 8% in code similarity for large input images.<n>Human evaluations show that DCGen can help developers implement webpages significantly faster and more similar to the UI designs.
arXiv Detail & Related papers (2024-06-24T07:58:36Z) - PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM [58.67882997399021]
Our research introduces a unified framework for automated graphic layout generation.<n>Our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts.<n>We develop an automated text-to-poster system that generates editable posters based on users' design intentions.
arXiv Detail & Related papers (2024-06-05T03:05:52Z) - Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering [74.99736967448423]
We construct Design2Code - the first real-world benchmark for this task.<n>We manually curate 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics.<n>Our fine-grained break-down metrics indicate that models mostly lag in recalling visual elements from the input webpages and generating correct layout designs.
arXiv Detail & Related papers (2024-03-05T17:56:27Z) - Reinforced UI Instruction Grounding: Towards a Generic UI Task
Automation API [17.991044940694778]
We build a multimodal model to ground natural language instructions in given UI screenshots as a generic UI task automation executor.
To facilitate the exploitation of image-to-text pretrained knowledge, we follow the pixel-to-sequence paradigm.
Our proposed reinforced UI instruction grounding model outperforms the state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2023-10-07T07:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.