UICrit: Enhancing Automated Design Evaluation with a UICritique Dataset
- URL: http://arxiv.org/abs/2407.08850v3
- Date: Tue, 13 Aug 2024 23:41:43 GMT
- Title: UICrit: Enhancing Automated Design Evaluation with a UICritique Dataset
- Authors: Peitong Duan, Chin-yi Chen, Gang Li, Bjoern Hartmann, Yang Li,
- Abstract summary: We present a targeted dataset of 3,059 design critiques and quality ratings for 983 mobile UIs.
We apply this dataset to achieve 55% performance gain in LLM-generated UI feedback.
We discuss future applications of this dataset, including training a reward model for generative UI techniques.
- Score: 10.427243347670965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated UI evaluation can be beneficial for the design process; for example, to compare different UI designs, or conduct automated heuristic evaluation. LLM-based UI evaluation, in particular, holds the promise of generalizability to a wide variety of UI types and evaluation tasks. However, current LLM-based techniques do not yet match the performance of human evaluators. We hypothesize that automatic evaluation can be improved by collecting a targeted UI feedback dataset and then using this dataset to enhance the performance of general-purpose LLMs. We present a targeted dataset of 3,059 design critiques and quality ratings for 983 mobile UIs, collected from seven experienced designers. We carried out an in-depth analysis to characterize the dataset's features. We then applied this dataset to achieve a 55% performance gain in LLM-generated UI feedback via various few-shot and visual prompting techniques. We also discuss future applications of this dataset, including training a reward model for generative UI techniques, and fine-tuning a tool-agnostic multi-modal LLM that automates UI evaluation.
Related papers
- AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [55.14033256706175]
Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information.
We introduce AutoBench-V, an automated framework for serving evaluation on demand.
Through an extensive evaluation of seven popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - UIClip: A Data-driven Model for Assessing User Interface Design [20.66914084220734]
We develop a machine-learned model, UIClip, for assessing the design quality and visual relevance of a user interface.
We show how UIClip can facilitate downstream applications that rely on instantaneous assessment of UI design quality.
arXiv Detail & Related papers (2024-04-18T20:43:08Z) - Vi(E)va LLM! A Conceptual Stack for Evaluating and Interpreting
Generative AI-based Visualizations [1.709620026135923]
Large language models (LLM) have become an interesting option for supporting generative tasks related to visualization.
This paper copes with the problem of modeling the evaluation of a generated visualization through an LLM.
We propose a theoretical evaluation stack, EvaLLM, that decomposes the evaluation effort in its atomic components.
arXiv Detail & Related papers (2024-02-03T14:28:55Z) - Large Language Models as Automated Aligners for benchmarking
Vision-Language Models [48.4367174400306]
Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks.
Existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence.
In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient curation, measuring the alignment betweenVLMs and human intelligence and value through automatic data curation and assessment.
arXiv Detail & Related papers (2023-11-24T16:12:05Z) - What Makes for Good Visual Instructions? Synthesizing Complex Visual
Reasoning Instructions for Visual Instruction Tuning [115.19451843294154]
Visual instruction tuning is an essential approach to improving the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs)
We propose a systematic approach to automatically creating high-quality complex visual reasoning instructions.
Our dataset consistently enhances the performance of all the compared MLLMs, e.g., improving the performance of MiniGPT-4 and BLIP-2 on MME-Cognition by 32.6% and 28.8%, respectively.
arXiv Detail & Related papers (2023-11-02T15:36:12Z) - On the Evaluation and Refinement of Vision-Language Instruction Tuning
Datasets [71.54954966652286]
We try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets.
We build a new dataset, REVO-LION, by collecting samples with higher SQ from each dataset.
Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up.
arXiv Detail & Related papers (2023-10-10T13:01:38Z) - ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine
Conversations [13.939350184164017]
Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language.
We adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM)
We generate a dataset of 335K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks.
arXiv Detail & Related papers (2023-10-07T16:32:34Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - Towards Better Semantic Understanding of Mobile Interfaces [7.756895821262432]
We release a human-annotated dataset with approximately 500k unique annotations aimed at increasing the understanding of the functionality of UI elements.
This dataset augments images and view hierarchies from RICO, a large dataset of mobile UIs.
We also release models using image-only and multimodal inputs; we experiment with various architectures and study the benefits of using multimodal inputs on the new dataset.
arXiv Detail & Related papers (2022-10-06T03:48:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.