Related papers: UICrit: Enhancing Automated Design Evaluation with a UICritique Dataset

UICrit: Enhancing Automated Design Evaluation with a UICritique Dataset

URL: http://arxiv.org/abs/2407.08850v3
Date: Tue, 13 Aug 2024 23:41:43 GMT
Title: UICrit: Enhancing Automated Design Evaluation with a UICritique Dataset
Authors: Peitong Duan, Chin-yi Chen, Gang Li, Bjoern Hartmann, Yang Li,
Abstract summary: We present a targeted dataset of 3,059 design critiques and quality ratings for 983 mobile UIs. We apply this dataset to achieve 55% performance gain in LLM-generated UI feedback. We discuss future applications of this dataset, including training a reward model for generative UI techniques.
Score: 10.427243347670965
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated UI evaluation can be beneficial for the design process; for example, to compare different UI designs, or conduct automated heuristic evaluation. LLM-based UI evaluation, in particular, holds the promise of generalizability to a wide variety of UI types and evaluation tasks. However, current LLM-based techniques do not yet match the performance of human evaluators. We hypothesize that automatic evaluation can be improved by collecting a targeted UI feedback dataset and then using this dataset to enhance the performance of general-purpose LLMs. We present a targeted dataset of 3,059 design critiques and quality ratings for 983 mobile UIs, collected from seven experienced designers. We carried out an in-depth analysis to characterize the dataset's features. We then applied this dataset to achieve a 55% performance gain in LLM-generated UI feedback via various few-shot and visual prompting techniques. We also discuss future applications of this dataset, including training a reward model for generative UI techniques, and fine-tuning a tool-agnostic multi-modal LLM that automates UI evaluation.

Related papers

Leveraging Large Language Models for Classifying App Users' Feedback [0.7366405857677226]
We evaluate the capabilities of four advanced LLMs, including GPT-3.5-Turbo, GPT-4, Flan-T5, and Llama3-70b, to enhance user feedback classification.<n>Our findings indicate that LLMs when guided by well-crafted prompts, can effectively classify user feedback into coarse-grained categories.
arXiv Detail & Related papers (2025-07-11T01:33:54Z)
From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback [36.68929551237421]
We introduce bftextFeedbacker, an evaluation framework that provides comprehensive and fine-grained results.<n>Our project homepage and dataset are available at https://liudan193.io/Feedbacker.
arXiv Detail & Related papers (2025-05-10T16:52:40Z)
UVE: Are MLLMs Unified Evaluators for AI-Generated Videos? [20.199060287444162]
This work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AI-generated videos (AIGVs) UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Our results suggest that while advanced MLLMs still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation.
arXiv Detail & Related papers (2025-03-13T01:52:27Z)
PanguIR Technical Report for NTCIR-18 AEOLLM Task [12.061652026366591]
Large language models (LLMs) are increasingly critical and challenging to evaluate. Manual evaluation, while comprehensive, is often costly and resource-intensive. automatic evaluation offers greater scalability but is constrained by the limitations of its evaluation criteria.
arXiv Detail & Related papers (2025-03-04T07:40:02Z)
AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs [54.58905728115257]
We propose the methodname pipeline for automatically annotating UI elements with detailed functionality descriptions at scale. Specifically, we leverage large language models (LLMs) to infer element functionality by comparing the UI content changes before and after simulated interactions with specific UI elements. We construct an methodname-704k dataset using the proposed pipeline, featuring multi-resolution, multi-device screenshots, diverse data domains, and detailed functionality annotations that have never been provided by previous datasets.
arXiv Detail & Related papers (2025-02-04T03:39:59Z)
Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference [63.03859517284341]
An automatic evaluation framework aims to rank LLMs based on their alignment with human preferences. An automatic LLM bencher consists of four components: the input set, the evaluation model, the evaluation type and the aggregation method.
arXiv Detail & Related papers (2024-12-31T17:46:51Z)
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [55.14033256706175]
Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information. We introduce AutoBench-V, an automated framework for serving evaluation on demand. Through an extensive evaluation of seven popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
UIClip: A Data-driven Model for Assessing User Interface Design [20.66914084220734]
We develop a machine-learned model, UIClip, for assessing the design quality and visual relevance of a user interface. We show how UIClip can facilitate downstream applications that rely on instantaneous assessment of UI design quality.
arXiv Detail & Related papers (2024-04-18T20:43:08Z)
Vi(E)va LLM! A Conceptual Stack for Evaluating and Interpreting Generative AI-based Visualizations [1.709620026135923]
Large language models (LLM) have become an interesting option for supporting generative tasks related to visualization. This paper copes with the problem of modeling the evaluation of a generated visualization through an LLM. We propose a theoretical evaluation stack, EvaLLM, that decomposes the evaluation effort in its atomic components.
arXiv Detail & Related papers (2024-02-03T14:28:55Z)
Large Language Models as Automated Aligners for benchmarking Vision-Language Models [48.4367174400306]
Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. Existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient curation, measuring the alignment betweenVLMs and human intelligence and value through automatic data curation and assessment.
arXiv Detail & Related papers (2023-11-24T16:12:05Z)
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning [115.19451843294154]
Visual instruction tuning is an essential approach to improving the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs) We propose a systematic approach to automatically creating high-quality complex visual reasoning instructions. Our dataset consistently enhances the performance of all the compared MLLMs, e.g., improving the performance of MiniGPT-4 and BLIP-2 on MME-Cognition by 32.6% and 28.8%, respectively.
arXiv Detail & Related papers (2023-11-02T15:36:12Z)
On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets [71.54954966652286]
We try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets. We build a new dataset, REVO-LION, by collecting samples with higher SQ from each dataset. Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up.
arXiv Detail & Related papers (2023-10-10T13:01:38Z)
ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations [13.939350184164017]
Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language. We adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM) We generate a dataset of 335K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks.
arXiv Detail & Related papers (2023-10-07T16:32:34Z)
MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data. For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
Towards Better Semantic Understanding of Mobile Interfaces [7.756895821262432]
We release a human-annotated dataset with approximately 500k unique annotations aimed at increasing the understanding of the functionality of UI elements. This dataset augments images and view hierarchies from RICO, a large dataset of mobile UIs. We also release models using image-only and multimodal inputs; we experiment with various architectures and study the benefits of using multimodal inputs on the new dataset.
arXiv Detail & Related papers (2022-10-06T03:48:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.