Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs
- URL: http://arxiv.org/abs/2512.19918v1
- Date: Mon, 22 Dec 2025 22:45:39 GMT
- Title: Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs
- Authors: Houston H. Zhang, Tao Zhang, Baoze Lin, Yuanqi Xue, Yincheng Zhu, Huan Liu, Li Gu, Linfeng Ye, Ziqiang Wang, Xinxin Zuo, Yang Wang, Yuanhao Yu, Zhixiang Chi,
- Abstract summary: We formalize the Widget-to-Code (Widget2Code) setting and introduce an image-only widget benchmark with fine-grained, multi-dimensional evaluation metrics.<n> Benchmarking shows that although generalized large language models (MLLMs) outperform specialized UI2Code methods, they still produce unreliable and visually inconsistent code.<n>At the perceptual level, we follow widget design principles to assemble atomic components into complete layouts, equipped with icon retrieval and reusable visualization modules.
- Score: 28.028216548288725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: User interface to code (UI2Code) aims to generate executable code that can faithfully reconstruct a given input UI. Prior work focuses largely on web pages and mobile screens, leaving app widgets underexplored. Unlike web or mobile UIs with rich hierarchical context, widgets are compact, context-free micro-interfaces that summarize key information through dense layouts and iconography under strict spatial constraints. Moreover, while (image, code) pairs are widely available for web or mobile UIs, widget designs are proprietary and lack accessible markup. We formalize this setting as the Widget-to-Code (Widget2Code) and introduce an image-only widget benchmark with fine-grained, multi-dimensional evaluation metrics. Benchmarking shows that although generalized multimodal large language models (MLLMs) outperform specialized UI2Code methods, they still produce unreliable and visually inconsistent code. To address these limitations, we develop a baseline that jointly advances perceptual understanding and structured code generation. At the perceptual level, we follow widget design principles to assemble atomic components into complete layouts, equipped with icon retrieval and reusable visualization modules. At the system level, we design an end-to-end infrastructure, WidgetFactory, which includes a framework-agnostic widget-tailored domain-specific language (WidgetDSL) and a compiler that translates it into multiple front-end implementations (e.g., React, HTML/CSS). An adaptive rendering module further refines spatial dimensions to satisfy compactness constraints. Together, these contributions substantially enhance visual fidelity, establishing a strong baseline and unified infrastructure for future Widget2Code research.
Related papers
- ComUICoder: Component-based Reusable UI Code Generation for Complex Websites via Semantic Segmentation and Element-wise Feedback [38.10354940578983]
We introduce ComUICoder, a semantic-aware code generation tool for complex websites.<n>ComUICoder significantly improves overall generation quality and code reusability on complex multipage websites.
arXiv Detail & Related papers (2026-02-22T17:17:16Z) - UISearch: Graph-Based Embeddings for Multimodal Enterprise UI Screenshots Retrieval [1.3563834727527375]
We present a novel graph-based representation that converts UI screenshots into attributed graphs encoding hierarchical relationships and spatial arrangements.<n>A contrastive graph autoencoder learns embeddings preserving multi-level similarity across visual, structural, and semantic properties.<n>We implement this representation in UISearch, a multi-modal search framework that combines structural embeddings with semantic search through a composable query language.
arXiv Detail & Related papers (2025-11-24T18:20:08Z) - ConsistCompose: Unified Multimodal Layout Control for Image Composition [56.909072845166264]
We present ConsistCompose, a unified framework that embeds layout coordinates directly into language prompts.<n>We show that ConsistCompose substantially improves spatial accuracy over layout-controlled baselines.
arXiv Detail & Related papers (2025-11-23T08:14:53Z) - DisCo-Layout: Disentangling and Coordinating Semantic and Physical Refinement in a Multi-Agent Framework for 3D Indoor Layout Synthesis [76.7196710324494]
3D indoor layout synthesis is crucial for creating virtual environments.<n>DisCo is a novel framework that disentangles and coordinates physical and semantic refinement.
arXiv Detail & Related papers (2025-10-02T16:30:37Z) - ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents [40.697759330690815]
ScreenCoder is a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation.<n>By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches.<n>Our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness.
arXiv Detail & Related papers (2025-07-30T16:41:21Z) - CAL-RAG: Retrieval-Augmented Multi-Agent Generation for Content-Aware Layout Design [6.830055289299306]
CAL-RAG is a retrieval-augmented, agentic framework for content-aware layout generation.<n>We implement our framework using LangGraph and evaluate it on a benchmark rich in semantic variability.<n>Results demonstrate that combining retrieval augmentation with agentic multi-step reasoning yields a scalable, interpretable, and high-fidelity solution.
arXiv Detail & Related papers (2025-06-27T06:09:56Z) - MLLM-Based UI2Code Automation Guided by UI Layout Information [17.177322441575196]
We propose a novel MLLM-based framework generating UI code from real-world webpage images, which includes three key modules.<n>For evaluation, we build a new benchmark dataset which involves 350 real-world websites named Snap2Code.
arXiv Detail & Related papers (2025-06-12T06:04:16Z) - GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts [53.568057283934714]
We propose a Vision-Language Model (VLM)-based framework that generates content-aware text logo layouts.<n>We introduce two model techniques that reduce the computational cost for processing multiple glyph images simultaneously.<n>To support instruction tuning of our model, we construct two extensive text logo datasets that are five times larger than existing public datasets.
arXiv Detail & Related papers (2024-11-18T10:04:10Z) - Harnessing Webpage UIs for Text-Rich Visual Understanding [112.01029887404296]
We propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs)
These instructions are then paired with UI screenshots to train multimodal models.
We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts.
arXiv Detail & Related papers (2024-10-17T17:48:54Z) - LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language
Models [84.16541551923221]
We propose a model that treats layout generation as a code generation task to enhance semantic information.
We develop a Code Instruct Tuning (CIT) approach comprising three interconnected modules.
We attain significant state-of-the-art performance on multiple datasets.
arXiv Detail & Related papers (2023-09-18T06:35:10Z) - VINS: Visual Search for Mobile User Interface Design [66.28088601689069]
This paper introduces VINS, a visual search framework, that takes as input a UI image and retrieves visually similar design examples.
The framework achieves a mean Average Precision of 76.39% for the UI detection and high performance in querying similar UI designs.
arXiv Detail & Related papers (2021-02-10T01:46:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.