Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach
- URL: http://arxiv.org/abs/2406.16386v3
- Date: Fri, 25 Apr 2025 15:12:34 GMT
- Title: Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach
- Authors: Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, Michael R. Lyu,
- Abstract summary: We propose DCGen, a divide-and-based approach to automate the translation of webpage design to UI code.<n>We show that DCGen achieves up to a 15% improvement in visual similarity and 8% in code similarity for large input images.<n>Human evaluations show that DCGen can help developers implement webpages significantly faster and more similar to the UI designs.
- Score: 51.522121376987634
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Websites are critical in today's digital world, with over 1.11 billion currently active and approximately 252,000 new sites launched daily. Converting website layout design into functional UI code is a time-consuming yet indispensable step of website development. Manual methods of converting visual designs into functional code present significant challenges, especially for non-experts. To explore automatic design-to-code solutions, we first conduct a motivating study on GPT-4o and identify three types of issues in generating UI code: element omission, element distortion, and element misarrangement. We further reveal that a focus on smaller visual segments can help multimodal large language models (MLLMs) mitigate these failures in the generation process. In this paper, we propose DCGen, a divide-and-conquer-based approach to automate the translation of webpage design to UI code. DCGen starts by dividing screenshots into manageable segments, generating code for each segment, and then reassembling them into complete UI code for the entire screenshot. We conduct extensive testing with a dataset comprised of real-world websites and various MLLMs and demonstrate that DCGen achieves up to a 15% improvement in visual similarity and 8% in code similarity for large input images. Human evaluations show that DCGen can help developers implement webpages significantly faster and more similar to the UI designs. To the best of our knowledge, DCGen is the first segment-aware MLLM-based approach for generating UI code directly from screenshots.
Related papers
- ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents [35.10813247827737]
We introduce a modular multi-agent framework that performs user interface-to-code generation in three interpretable stages.<n>The framework improves robustness, interpretability, and fidelity over end-to-end black-box methods.<n>Our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness.
arXiv Detail & Related papers (2025-07-30T16:41:21Z) - MLLM-Based UI2Code Automation Guided by UI Layout Information [17.177322441575196]
We propose a novel MLLM-based framework generating UI code from real-world webpage images, which includes three key modules.<n>For evaluation, we build a new benchmark dataset which involves 350 real-world websites named Snap2Code.
arXiv Detail & Related papers (2025-06-12T06:04:16Z) - UICopilot: Automating UI Synthesis via Hierarchical Code Generation from Webpage Designs [43.006316221657904]
This paper proposes a novel approach to automating the synthesis of User Interfaces (UIs) via hierarchical code generation from webpage designs.<n>The core idea of UICopilot is to decompose the generation process into two stages: first, generating the coarse-grained HTML structure, followed by the generation of fine-grained code.<n> Experimental results demonstrate that UICopilot significantly outperforms existing baselines in both automatic evaluation and human evaluations.
arXiv Detail & Related papers (2025-05-15T02:09:54Z) - Harnessing Webpage UIs for Text-Rich Visual Understanding [112.01029887404296]
We propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs)
These instructions are then paired with UI screenshots to train multimodal models.
We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts.
arXiv Detail & Related papers (2024-10-17T17:48:54Z) - Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio.
Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code.
We propose a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z) - VISION2UI: A Real-World Dataset with Layout for Code Generation from UI Designs [29.80918775422563]
We present a novel dataset, termed VISION2UI, extracted from real-world scenarios, augmented with comprehensive layout information.
This dataset is derived through a series of operations, encompassing collecting, cleaning, and filtering of the open-source Common Crawl dataset.
Ultimately, this process yields a dataset comprising 2,000 parallel samples encompassing design visions and UI code.
arXiv Detail & Related papers (2024-04-09T15:05:48Z) - Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset [8.581656334758547]
We introduce WebSight, a dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots.
To accelerate the research in this area, we open-source WebSight.
arXiv Detail & Related papers (2024-03-14T01:40:40Z) - Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering [74.99736967448423]
We construct Design2Code - the first real-world benchmark for this task.
We manually curate 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics.
Our fine-grained break-down metrics indicate that models mostly lag in recalling visual elements from the input webpages and generating correct layout designs.
arXiv Detail & Related papers (2024-03-05T17:56:27Z) - Reinforced UI Instruction Grounding: Towards a Generic UI Task
Automation API [17.991044940694778]
We build a multimodal model to ground natural language instructions in given UI screenshots as a generic UI task automation executor.
To facilitate the exploitation of image-to-text pretrained knowledge, we follow the pixel-to-sequence paradigm.
Our proposed reinforced UI instruction grounding model outperforms the state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2023-10-07T07:22:41Z) - Learning UI-to-Code Reverse Generator Using Visual Critic Without
Rendering [18.74127660489501]
We propose a novel vision-code transformer (ViCT) composed of a vision encoder processing the screenshots and a language decoder to generate the code.
They are by pre-trained models but aligning the two modalities requires end-to-end finetuning.
ViCT can achieve comparable performance as when using a larger decoder such as LLaMA.
arXiv Detail & Related papers (2023-05-24T02:17:32Z) - Towards Accurate Image Coding: Improved Autoregressive Image Generation
with Dynamic Vector Quantization [73.52943587514386]
Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm.
We propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based their information densities for accurate representation.
arXiv Detail & Related papers (2023-05-19T14:56:05Z) - Transformer-Based Visual Segmentation: A Survey [118.01564082499948]
Visual segmentation seeks to partition images, video frames, or point clouds into multiple segments or groups.
Transformers are a type of neural network based on self-attention originally designed for natural language processing.
Transformers offer robust, unified, and even simpler solutions for various segmentation tasks.
arXiv Detail & Related papers (2023-04-19T17:59:02Z) - Sketch2FullStack: Generating Skeleton Code of Full Stack Website and
Application from Sketch using Deep Learning and Computer Vision [2.422788410602121]
It requires a team of experienced developers specifically to design a large website and then convert it to code.
It would save valuable resources and fasten the overall development process.
arXiv Detail & Related papers (2022-11-26T16:32:13Z) - HAT: Hierarchical Aggregation Transformers for Person Re-identification [87.02828084991062]
We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance.
Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
arXiv Detail & Related papers (2021-07-13T09:34:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.