Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping
- URL: http://arxiv.org/abs/2410.16232v1
- Date: Mon, 21 Oct 2024 17:39:49 GMT
- Title: Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping
- Authors: Ryan Li, Yanzhe Zhang, Diyi Yang,
- Abstract summary: We introduce Sketch2Code, a benchmark that evaluates state-of-the-art Vision Language Models (VLMs) on automating the conversion of rudimentary sketches into webpage prototypes.
We analyze ten commercial and open-source models, showing that Sketch2Code is challenging for existing VLMs.
A user study with UI/UX experts reveals a significant preference for proactive question-asking over passive feedback reception.
- Score: 55.98643055756135
- License:
- Abstract: Sketches are a natural and accessible medium for UI designers to conceptualize early-stage ideas. However, existing research on UI/UX automation often requires high-fidelity inputs like Figma designs or detailed screenshots, limiting accessibility and impeding efficient design iteration. To bridge this gap, we introduce Sketch2Code, a benchmark that evaluates state-of-the-art Vision Language Models (VLMs) on automating the conversion of rudimentary sketches into webpage prototypes. Beyond end-to-end benchmarking, Sketch2Code supports interactive agent evaluation that mimics real-world design workflows, where a VLM-based agent iteratively refines its generations by communicating with a simulated user, either passively receiving feedback instructions or proactively asking clarification questions. We comprehensively analyze ten commercial and open-source models, showing that Sketch2Code is challenging for existing VLMs; even the most capable models struggle to accurately interpret sketches and formulate effective questions that lead to steady improvement. Nevertheless, a user study with UI/UX experts reveals a significant preference for proactive question-asking over passive feedback reception, highlighting the need to develop more effective paradigms for multi-turn conversational agents.
Related papers
- NODE-Adapter: Neural Ordinary Differential Equations for Better Vision-Language Reasoning [30.440574052935407]
existing methods encounter three major challenges in vision-language reasoning.
We propose a novel method called NODE-Adapter, which utilizes Neural Ordinary Differential Equations for better vision-language reasoning.
Our experimental results, which cover few-shot classification, domain generalization, and visual reasoning on human-object interaction, demonstrate that the proposed method significantly outperforms existing state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-11T17:04:19Z) - Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - Evaluating how interactive visualizations can assist in finding samples where and how computer vision models make mistakes [1.76602679361245]
We present two interactive visualizations in the context of Sprite, a system for creating Computer Vision (CV) models.
We study how these visualizations help Sprite's users identify (evaluate) and select (plan) images where a model is struggling and can lead to improved performance.
arXiv Detail & Related papers (2023-05-19T14:43:00Z) - Evaluation of Sketch-Based and Semantic-Based Modalities for Mockup
Generation [15.838427479984926]
Design mockups are essential instruments for visualizing and testing design ideas.
We present and evaluate two different modalities for generating mockups based on hand-drawn sketches.
Our results show that sketch-based generation was more intuitive and expressive, while semantic-based generative AI obtained better results in terms of quality and fidelity.
arXiv Detail & Related papers (2023-03-22T16:47:36Z) - fAIlureNotes: Supporting Designers in Understanding the Limits of AI
Models for Computer Vision Tasks [32.53515595703429]
fAIlureNotes is a designer-centered failure exploration and analysis tool.
It supports designers in evaluating models and identifying failures across diverse user groups and scenarios.
arXiv Detail & Related papers (2023-02-22T23:41:36Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Distilled Dual-Encoder Model for Vision-Language Understanding [50.42062182895373]
We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks.
We show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements.
arXiv Detail & Related papers (2021-12-16T09:21:18Z) - VisQA: X-raying Vision and Language Reasoning in Transformers [10.439369423744708]
Recent research has shown that state-of-the-art models tend to produce answers exploiting biases and shortcuts in the training data.
We present VisQA, a visual analytics tool that explores this question of reasoning vs. bias exploitation.
arXiv Detail & Related papers (2021-04-02T08:08:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.