Design2Code: How Far Are We From Automating Front-End Engineering?
- URL: http://arxiv.org/abs/2403.03163v1
- Date: Tue, 5 Mar 2024 17:56:27 GMT
- Title: Design2Code: How Far Are We From Automating Front-End Engineering?
- Authors: Chenglei Si, Yanzhe Zhang, Zhengyuan Yang, Ruibo Liu, Diyi Yang
- Abstract summary: We formalize this as a Design2Code task and conduct comprehensive benchmarking.
Specifically, we manually curate a benchmark of 484 diverse real-world webpages as test cases.
We develop a suite of multimodal prompting methods and show their effectiveness on GPT-4V and Gemini Pro Vision.
Both human evaluation and automatic metrics show that GPT-4V performs the best on this task compared to other models.
- Score: 83.06100360864502
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative AI has made rapid advancements in recent years, achieving
unprecedented capabilities in multimodal understanding and code generation.
This can enable a new paradigm of front-end development, in which multimodal
LLMs might directly convert visual designs into code implementations. In this
work, we formalize this as a Design2Code task and conduct comprehensive
benchmarking. Specifically, we manually curate a benchmark of 484 diverse
real-world webpages as test cases and develop a set of automatic evaluation
metrics to assess how well current multimodal LLMs can generate the code
implementations that directly render into the given reference webpages, given
the screenshots as input. We also complement automatic metrics with
comprehensive human evaluations. We develop a suite of multimodal prompting
methods and show their effectiveness on GPT-4V and Gemini Pro Vision. We
further finetune an open-source Design2Code-18B model that successfully matches
the performance of Gemini Pro Vision. Both human evaluation and automatic
metrics show that GPT-4V performs the best on this task compared to other
models. Moreover, annotators think GPT-4V generated webpages can replace the
original reference webpages in 49% of cases in terms of visual appearance and
content; and perhaps surprisingly, in 64% of cases GPT-4V generated webpages
are considered better than the original reference webpages. Our fine-grained
break-down metrics indicate that open-source models mostly lag in recalling
visual elements from the input webpages and in generating correct layout
designs, while aspects like text content and coloring can be drastically
improved with proper finetuning.
Related papers
- Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach [51.522121376987634]
We propose DCGen, a divide-and-based approach to automate the translation of webpage design to UI code.
DCGen starts by dividing screenshots into manageable segments, generating descriptions for each segment, and then reassembling them into complete UI code for the entire screenshot.
We conduct extensive testing with a dataset comprised of real-world websites and various MLLMs and demonstrate that DCGen achieves up to a 14% improvement in visual similarity over competing methods.
arXiv Detail & Related papers (2024-06-24T07:58:36Z) - GPT-4V(ision) is a Generalist Web Agent, if Grounded [20.940613419944015]
We show that GPT-4V can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites.
We propose SEEACT, a web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web.
arXiv Detail & Related papers (2024-01-03T08:33:09Z) - From Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design [5.268919870502001]
This paper presents a comprehensive evaluation of vision-language models (VLMs) across a spectrum of engineering design tasks.
Specifically in this paper, we assess the capabilities of two VLMs, GPT-4V and LLaVA 1.6 34B, in design tasks such as sketch similarity analysis, CAD generation, topology optimization, manufacturability assessment, and engineering textbook problems.
arXiv Detail & Related papers (2023-11-21T15:20:48Z) - The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) [121.42924593374127]
We analyze the latest model, GPT-4V, to deepen the understanding of LMMs.
GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs makes it a powerful multimodal generalist system.
GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods.
arXiv Detail & Related papers (2023-09-29T17:34:51Z) - Visual Programming for Text-to-Image Generation and Evaluation [73.12069620086311]
We propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation.
First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation.
Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming.
arXiv Detail & Related papers (2023-05-24T16:42:17Z) - Learning UI-to-Code Reverse Generator Using Visual Critic Without
Rendering [18.74127660489501]
We propose a novel vision-code transformer (ViCT) composed of a vision encoder processing the screenshots and a language decoder to generate the code.
They are by pre-trained models but aligning the two modalities requires end-to-end finetuning.
ViCT can achieve comparable performance as when using a larger decoder such as LLaMA.
arXiv Detail & Related papers (2023-05-24T02:17:32Z) - Multimodal Web Navigation with Instruction-Finetuned Foundation Models [99.14209521903854]
We study data-driven offline training for web agents with vision-language foundation models.
We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages.
We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning.
arXiv Detail & Related papers (2023-05-19T17:44:34Z) - Visual Instruction Tuning [79.70923292053097]
We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data.
By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant.
When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.
arXiv Detail & Related papers (2023-04-17T17:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.