Bridging Design and Development with Automated Declarative UI Code Generation
- URL: http://arxiv.org/abs/2409.11667v1
- Date: Wed, 18 Sep 2024 03:04:12 GMT
- Title: Bridging Design and Development with Automated Declarative UI Code Generation
- Authors: Ting Zhou, Yanjie Zhao, Xinyi Hou, Xiaoyu Sun, Kai Chen, Haoyu Wang,
- Abstract summary: Declarative UI frameworks have gained widespread adoption in mobile app development, offering benefits such as improved code readability and easier maintenance.
Recent advancements in multimodal large language models (MLLMs) have shown promise in directly generating mobile app code from user interface (UI) designs.
We propose DeclarUI, an automated approach that synergizes computer vision (CV), MLLMs, and iterative compiler-driven optimization to generate and refine declarative UI code from designs.
- Score: 18.940075474582564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Declarative UI frameworks have gained widespread adoption in mobile app development, offering benefits such as improved code readability and easier maintenance. Despite these advantages, the process of translating UI designs into functional code remains challenging and time-consuming. Recent advancements in multimodal large language models (MLLMs) have shown promise in directly generating mobile app code from user interface (UI) designs. However, the direct application of MLLMs to this task is limited by challenges in accurately recognizing UI components and comprehensively capturing interaction logic. To address these challenges, we propose DeclarUI, an automated approach that synergizes computer vision (CV), MLLMs, and iterative compiler-driven optimization to generate and refine declarative UI code from designs. DeclarUI enhances visual fidelity, functional completeness, and code quality through precise component segmentation, Page Transition Graphs (PTGs) for modeling complex inter-page relationships, and iterative optimization. In our evaluation, DeclarUI outperforms baselines on React Native, a widely adopted declarative UI framework, achieving a 96.8% PTG coverage rate and a 98% compilation success rate. Notably, DeclarUI demonstrates significant improvements over state-of-the-art MLLMs, with a 123% increase in PTG coverage rate, up to 55% enhancement in visual similarity scores, and a 29% boost in compilation success rate. We further demonstrate DeclarUI's generalizability through successful applications to Flutter and ArkUI frameworks.
Related papers
- ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents [35.10813247827737]
We introduce a modular multi-agent framework that performs user interface-to-code generation in three interpretable stages.<n>The framework improves robustness, interpretability, and fidelity over end-to-end black-box methods.<n>Our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness.
arXiv Detail & Related papers (2025-07-30T16:41:21Z) - DesignCoder: Hierarchy-Aware and Self-Correcting UI Code Generation with Large Language Models [17.348284143568282]
DesignCoder is a novel hierarchical-aware and self-correcting automated code generation framework.<n>We introduce UI Grouping Chains, which enhance MLLMs' capability to understand and predict complex nested UI hierarchies.<n>We also incorporate a self-correction mechanism to improve the model's ability to identify and rectify errors in the generated code.
arXiv Detail & Related papers (2025-06-16T16:20:43Z) - Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition [16.68658893305642]
Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR)<n>We introduce Uni-MuMER, which fully fine-tunes a vision-language model for the HMER task without modifying its architecture.<n>Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions.
arXiv Detail & Related papers (2025-05-29T15:41:00Z) - Advancing vision-language models in front-end development via data synthesis [30.287628180320137]
We propose a reflective agentic workflow that synthesizes high-quality image-text data to capture the diverse characteristics of front-end development.
This workflow automates the extraction of self-containedfootnoteA textbfself-contained code snippet from real-world projects, renders the corresponding visual outputs, and generates detailed descriptions that link design elements to functional code.
We build a large vision-language model, Flame, trained on the synthesized datasets and demonstrate its effectiveness in generating React code via the $textpass@k$ metric.
arXiv Detail & Related papers (2025-03-03T14:54:01Z) - Comparative Analysis of Large Language Models for Context-Aware Code Completion using SAFIM Framework [5.312946761836463]
Large Language Models (LLMs) have revolutionized code completion, transforming it into a more intelligent and context-aware feature.
This study evaluates the performance of several chat-based LLMs, including Gemini 1.5 Flash, Gemini 1.5 Pro, GPT-4o, GPT-4o-mini, and GPT-4 Turbo.
arXiv Detail & Related papers (2025-02-21T06:32:31Z) - ContextFormer: Redefining Efficiency in Semantic Segmentation [48.81126061219231]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships.
Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands.
We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z) - Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining [67.87810796668981]
Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL)
Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations.
These improvements translate to significant gains in both web and OS agent downstream tasks.
arXiv Detail & Related papers (2024-12-13T18:40:10Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - Explainable Behavior Cloning: Teaching Large Language Model Agents through Learning by Demonstration [15.786934863494785]
We propose an Explainable Behavior Cloning LLM Agent (EBC-LLMAgent), a novel approach that combines large language models (LLMs) with behavior cloning by learning demonstrations.
EBC-LLMAgent consists of three core modules: Demonstration Developing Code Generation, and UI Mapping, which work synergistically to capture user demonstrations, generate executable codes, and establish accurate correspondence between code and UI elements.
Extensive experiments on five popular mobile applications from diverse domains demonstrate the superior performance of EBC-LLMAgent, achieving high success rates in task completion, efficient generalization to unseen scenarios, and the generation of meaningful
arXiv Detail & Related papers (2024-10-30T11:14:33Z) - EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - A Rule-Based Approach for UI Migration from Android to iOS [11.229343760409044]
We propose a novel approach called GUIMIGRATOR, which enables the cross platform migration of existing Android app UIs to iOS.
GuiMIGRATOR extracts and parses Android UI layouts, views, and resources to construct a UI skeleton tree.
GuiMIGRATOR generates the final UI code files utilizing target code templates, which are then compiled and validated in the iOS development platform.
arXiv Detail & Related papers (2024-09-25T06:19:54Z) - Inference Optimization of Foundation Models on AI Accelerators [68.24450520773688]
Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI.
As the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios.
This tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators.
arXiv Detail & Related papers (2024-07-12T09:24:34Z) - CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation [61.68049335444254]
Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments.
We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP)
With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
arXiv Detail & Related papers (2024-02-19T08:29:03Z) - What Makes for Good Visual Instructions? Synthesizing Complex Visual
Reasoning Instructions for Visual Instruction Tuning [115.19451843294154]
Visual instruction tuning is an essential approach to improving the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs)
We propose a systematic approach to automatically creating high-quality complex visual reasoning instructions.
Our dataset consistently enhances the performance of all the compared MLLMs, e.g., improving the performance of MiniGPT-4 and BLIP-2 on MME-Cognition by 32.6% and 28.8%, respectively.
arXiv Detail & Related papers (2023-11-02T15:36:12Z) - EGFE: End-to-end Grouping of Fragmented Elements in UI Designs with
Multimodal Learning [10.885275494978478]
Grouping fragmented elements can greatly improve the readability and maintainability of the generated code.
Current methods employ a two-stage strategy that introduces hand-crafted rules to group fragmented elements.
We propose EGFE, a novel method for automatically End-to-end Grouping Fragmented Elements via UI sequence prediction.
arXiv Detail & Related papers (2023-09-18T15:28:12Z) - SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex
Interactive Tasks [81.9962823875981]
We introduce SwiftSage, a novel agent framework inspired by the dual-process theory of human cognition.
The framework comprises two primary modules: the Swift module, representing fast and intuitive thinking, and the Sage module, emulating deliberate thought processes.
In 30 tasks from the ScienceWorld benchmark, SwiftSage significantly outperforms other methods such as SayCan, ReAct, and Reflex.
arXiv Detail & Related papers (2023-05-27T07:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.