Visual Prompting with Iterative Refinement for Design Critique Generation
- URL: http://arxiv.org/abs/2412.16829v1
- Date: Sun, 22 Dec 2024 02:35:57 GMT
- Title: Visual Prompting with Iterative Refinement for Design Critique Generation
- Authors: Peitong Duan, Chin-Yi Chen, Bjoern Hartmann, Yang Li,
- Abstract summary: We propose an iterative visual prompting approach for UI critique.
It generates a list of design comments, along with corresponding bounding boxes that map each comment to a specific region in a screenshot.
We evaluated our approach using Gemini-1.5-pro and GPT-4o, and found that human experts generally preferred the design critiques generated by our pipeline.
- Score: 7.666790719374632
- License:
- Abstract: Feedback is crucial for every design process, such as user interface (UI) design, and automating design critiques can significantly improve the efficiency of the design workflow. Although existing multimodal large language models (LLMs) excel in many tasks, they often struggle with generating high-quality design critiques -- a complex task that requires producing detailed design comments that are visually grounded in a given design's image. Building on recent advancements in iterative refinement of text output and visual prompting methods, we propose an iterative visual prompting approach for UI critique that takes an input UI screenshot and design guidelines and generates a list of design comments, along with corresponding bounding boxes that map each comment to a specific region in the screenshot. The entire process is driven completely by LLMs, which iteratively refine both the text output and bounding boxes using few-shot samples tailored for each step. We evaluated our approach using Gemini-1.5-pro and GPT-4o, and found that human experts generally preferred the design critiques generated by our pipeline over those by the baseline, with the pipeline reducing the gap from human performance by 50% for one rating metric. To assess the generalizability of our approach to other multimodal tasks, we applied our pipeline to open-vocabulary object and attribute detection, and experiments showed that our method also outperformed the baseline.
Related papers
- VisPath: Automated Visualization Code Synthesis via Multi-Path Reasoning and Feedback-Driven Optimization [13.964412839566293]
VisPath is a multi-stage framework specially designed to handle underspecified queries.
It first utilizes initial query to generate diverse reformulated queries via Chain-of-Thought (CoT) prompting.
refined queries are used to produce candidate visualization scripts, which are then executed to generate multiple images.
arXiv Detail & Related papers (2025-02-16T14:09:42Z) - GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts [53.568057283934714]
We propose a VLM-based framework that generates content-aware text logo layouts.
We introduce two model techniques to reduce the computation for processing multiple glyph images simultaneously.
To support instruction-tuning of out model, we construct two extensive text logo datasets, which are 5x more larger than the existing public dataset.
arXiv Detail & Related papers (2024-11-18T10:04:10Z) - Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping [55.98643055756135]
We introduce Sketch2Code, a benchmark that evaluates state-of-the-art Vision Language Models (VLMs) on automating the conversion of rudimentary sketches into webpage prototypes.
We analyze ten commercial and open-source models, showing that Sketch2Code is challenging for existing VLMs.
A user study with UI/UX experts reveals a significant preference for proactive question-asking over passive feedback reception.
arXiv Detail & Related papers (2024-10-21T17:39:49Z) - PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM [58.67882997399021]
Our research introduces a unified framework for automated graphic layout generation.
Our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts.
We develop an automated text-to-poster system that generates editable posters based on users' design intentions.
arXiv Detail & Related papers (2024-06-05T03:05:52Z) - DesignProbe: A Graphic Design Benchmark for Multimodal Large Language Models [35.10231741092462]
A well-executed graphic design typically achieves harmony in two levels, from the fine-grained design elements (color, font and layout) to the overall design.
With the rapid development of Multimodal Large Language Models (MLLMs), we establish the DesignProbe, a benchmark to investigate the capability of MLLMs in design.
arXiv Detail & Related papers (2024-04-23T07:31:19Z) - Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering [74.99736967448423]
We construct Design2Code - the first real-world benchmark for this task.
We manually curate 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics.
Our fine-grained break-down metrics indicate that models mostly lag in recalling visual elements from the input webpages and generating correct layout designs.
arXiv Detail & Related papers (2024-03-05T17:56:27Z) - De-fine: Decomposing and Refining Visual Programs with Auto-Feedback [75.62712247421146]
De-fine is a training-free framework that decomposes complex tasks into simpler subtasks and refines programs through auto-feedback.
Our experiments across various visual tasks show that De-fine creates more robust programs.
arXiv Detail & Related papers (2023-11-21T06:24:09Z) - Investigating Positive and Negative Qualities of Human-in-the-Loop
Optimization for Designing Interaction Techniques [55.492211642128446]
Designers reportedly struggle with design optimization tasks where they are asked to find a combination of design parameters that maximizes a given set of objectives.
Model-based computational design algorithms assist designers by generating design examples during design.
Black box methods for assistance, on the other hand, can work with any design problem.
arXiv Detail & Related papers (2022-04-15T20:40:43Z) - Creating User Interface Mock-ups from High-Level Text Descriptions with
Deep-Learning Models [19.63933191791183]
We introduce three deep-learning techniques to create low-fidelity UI mock-ups from a natural language phrase.
We quantitatively and qualitatively compare and contrast each method's ability in suggesting coherent, diverse and relevant UI design mock-ups.
arXiv Detail & Related papers (2021-10-14T23:48:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.