Related papers: Visual Prompting with Iterative Refinement for Design Critique Generation

Visual Prompting with Iterative Refinement for Design Critique Generation

URL: http://arxiv.org/abs/2412.16829v1
Date: Sun, 22 Dec 2024 02:35:57 GMT
Title: Visual Prompting with Iterative Refinement for Design Critique Generation
Authors: Peitong Duan, Chin-Yi Chen, Bjoern Hartmann, Yang Li,
Abstract summary: We propose an iterative visual prompting approach for UI critique.<n>It generates a list of design comments, along with corresponding bounding boxes that map each comment to a specific region in a screenshot.<n>We evaluated our approach using Gemini-1.5-pro and GPT-4o, and found that human experts generally preferred the design critiques generated by our pipeline.
Score: 7.666790719374632
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Feedback is crucial for every design process, such as user interface (UI) design, and automating design critiques can significantly improve the efficiency of the design workflow. Although existing multimodal large language models (LLMs) excel in many tasks, they often struggle with generating high-quality design critiques -- a complex task that requires producing detailed design comments that are visually grounded in a given design's image. Building on recent advancements in iterative refinement of text output and visual prompting methods, we propose an iterative visual prompting approach for UI critique that takes an input UI screenshot and design guidelines and generates a list of design comments, along with corresponding bounding boxes that map each comment to a specific region in the screenshot. The entire process is driven completely by LLMs, which iteratively refine both the text output and bounding boxes using few-shot samples tailored for each step. We evaluated our approach using Gemini-1.5-pro and GPT-4o, and found that human experts generally preferred the design critiques generated by our pipeline over those by the baseline, with the pipeline reducing the gap from human performance by 50% for one rating metric. To assess the generalizability of our approach to other multimodal tasks, we applied our pipeline to open-vocabulary object and attribute detection, and experiments showed that our method also outperformed the baseline.

Related papers

DeepVIS: Bridging Natural Language and Data Visualization Through Step-wise Reasoning [6.901863663424825]
We propose integrating Chain-of-Thought (CoT) reasoning into the Natural Language to Visualization (NL2VIS) pipeline.<n>First, we design a comprehensive CoT reasoning process for NL2VIS and develop an automatic pipeline to equip existing datasets with structured reasoning steps.<n>Second, we introduce nvBench-CoT, a specialized dataset capturing detailed step-by-step reasoning from ambiguous natural language descriptions to finalized visualizations.<n>Third, we develop DeepVIS, an interactive visual interface that tightly integrates with the CoT reasoning process.
arXiv Detail & Related papers (2025-08-03T10:04:17Z)
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding. Existing solutions often rely on task-specific architectures and objectives for individual tasks. In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z)
VisPath: Automated Visualization Code Synthesis via Multi-Path Reasoning and Feedback-Driven Optimization [13.964412839566293]
VisPath is a multi-stage framework specially designed to handle underspecified queries. It first utilizes initial query to generate diverse reformulated queries via Chain-of-Thought (CoT) prompting. refined queries are used to produce candidate visualization scripts, which are then executed to generate multiple images.
arXiv Detail & Related papers (2025-02-16T14:09:42Z)
GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts [53.568057283934714]
We propose a VLM-based framework that generates content-aware text logo layouts. We introduce two model techniques to reduce the computation for processing multiple glyph images simultaneously. To support instruction-tuning of out model, we construct two extensive text logo datasets, which are 5x more larger than the existing public dataset.
arXiv Detail & Related papers (2024-11-18T10:04:10Z)
Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping [55.98643055756135]
We introduce Sketch2Code, a benchmark that evaluates state-of-the-art Vision Language Models (VLMs) on automating the conversion of rudimentary sketches into webpage prototypes. We analyze ten commercial and open-source models, showing that Sketch2Code is challenging for existing VLMs. A user study with UI/UX experts reveals a significant preference for proactive question-asking over passive feedback reception.
arXiv Detail & Related papers (2024-10-21T17:39:49Z)
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM [58.67882997399021]
Our research introduces a unified framework for automated graphic layout generation.<n>Our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts.<n>We develop an automated text-to-poster system that generates editable posters based on users' design intentions.
arXiv Detail & Related papers (2024-06-05T03:05:52Z)
DesignProbe: A Graphic Design Benchmark for Multimodal Large Language Models [35.10231741092462]
A well-executed graphic design typically achieves harmony in two levels, from the fine-grained design elements (color, font and layout) to the overall design. With the rapid development of Multimodal Large Language Models (MLLMs), we establish the DesignProbe, a benchmark to investigate the capability of MLLMs in design.
arXiv Detail & Related papers (2024-04-23T07:31:19Z)
Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering [74.99736967448423]
We construct Design2Code - the first real-world benchmark for this task. We manually curate 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics. Our fine-grained break-down metrics indicate that models mostly lag in recalling visual elements from the input webpages and generating correct layout designs.
arXiv Detail & Related papers (2024-03-05T17:56:27Z)
De-fine: Decomposing and Refining Visual Programs with Auto-Feedback [75.62712247421146]
De-fine is a training-free framework that decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. Our experiments across various visual tasks show that De-fine creates more robust programs.
arXiv Detail & Related papers (2023-11-21T06:24:09Z)
Creating User Interface Mock-ups from High-Level Text Descriptions with Deep-Learning Models [19.63933191791183]
We introduce three deep-learning techniques to create low-fidelity UI mock-ups from a natural language phrase. We quantitatively and qualitatively compare and contrast each method's ability in suggesting coherent, diverse and relevant UI design mock-ups.
arXiv Detail & Related papers (2021-10-14T23:48:46Z)
The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models [34.94331039746062]
We propose a design pattern for tackling text ranking problems, dubbed "Expando-Mono-Duo" At the core, our design relies on pretrained sequence-to-sequence models within a standard multi-stage ranking architecture. We present experimental results from the MS MARCO passage and document ranking tasks, the TREC 2020 Deep Learning Track, and the TREC-COVID challenge that validate our design.
arXiv Detail & Related papers (2021-01-14T15:29:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.