Related papers: VINS: Visual Search for Mobile User Interface Design

VINS: Visual Search for Mobile User Interface Design

URL: http://arxiv.org/abs/2102.05216v1
Date: Wed, 10 Feb 2021 01:46:33 GMT
Title: VINS: Visual Search for Mobile User Interface Design
Authors: Sara Bunian, Kai Li, Chaima Jemmali, Casper Harteveld, Yun Fu, Magy Seif El-Nasr
Abstract summary: This paper introduces VINS, a visual search framework, that takes as input a UI image and retrieves visually similar design examples. The framework achieves a mean Average Precision of 76.39% for the UI detection and high performance in querying similar UI designs.
Score: 66.28088601689069
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Searching for relative mobile user interface (UI) design examples can aid interface designers in gaining inspiration and comparing design alternatives. However, finding such design examples is challenging, especially as current search systems rely on only text-based queries and do not consider the UI structure and content into account. This paper introduces VINS, a visual search framework, that takes as input a UI image (wireframe, high-fidelity) and retrieves visually similar design examples. We first survey interface designers to better understand their example finding process. We then develop a large-scale UI dataset that provides an accurate specification of the interface's view hierarchy (i.e., all the UI components and their specific location). By utilizing this dataset, we propose an object-detection based image retrieval framework that models the UI context and hierarchical structure. The framework achieves a mean Average Precision of 76.39\% for the UI detection and high performance in querying similar UI designs.

Related papers

AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs [54.58905728115257]
We propose the methodname pipeline for automatically annotating UI elements with detailed functionality descriptions at scale. Specifically, we leverage large language models (LLMs) to infer element functionality by comparing the UI content changes before and after simulated interactions with specific UI elements. We construct an methodname-704k dataset using the proposed pipeline, featuring multi-resolution, multi-device screenshots, diverse data domains, and detailed functionality annotations that have never been provided by previous datasets.
arXiv Detail & Related papers (2025-02-04T03:39:59Z)
Falcon-UI: Understanding GUI Before Following User Instructions [57.67308498231232]
We introduce an instruction-free GUI navigation dataset, termed Insight-UI dataset, to enhance model comprehension of GUI environments. Insight-UI dataset is automatically generated from the Common Crawl corpus, simulating various platforms. We develop the GUI agent model Falcon-UI, which is initially pretrained on Insight-UI dataset and subsequently fine-tuned on Android and Web GUI datasets.
arXiv Detail & Related papers (2024-12-12T15:29:36Z)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations. ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z)
Harnessing Webpage UIs for Text-Rich Visual Understanding [112.01029887404296]
We propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs) These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts.
arXiv Detail & Related papers (2024-10-17T17:48:54Z)
Tell Me What's Next: Textual Foresight for Generic UI Representations [65.10591722192609]
We propose Textual Foresight, a novel pretraining objective for learning UI screen representations. Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken. We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning.
arXiv Detail & Related papers (2024-06-12T02:43:19Z)
UIClip: A Data-driven Model for Assessing User Interface Design [20.66914084220734]
We develop a machine-learned model, UIClip, for assessing the design quality and visual relevance of a user interface. We show how UIClip can facilitate downstream applications that rely on instantaneous assessment of UI design quality.
arXiv Detail & Related papers (2024-04-18T20:43:08Z)
OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition [79.852642726105]
We propose a unified paradigm for parsing visually-situated text across diverse scenarios. Specifically, we devise a universal model, called Omni, which can simultaneously handle three typical visually-situated text parsing tasks. In Omni, all tasks share the unified encoder-decoder architecture, the unified objective point-conditioned text generation, and the unified input representation.
arXiv Detail & Related papers (2024-03-28T03:51:14Z)
UI Layers Group Detector: Grouping UI Layers via Text Fusion and Box Attention [7.614630088064978]
We propose a vision-based method that automatically detects images (i.e., basic shapes and visual elements) and text layers that present the same semantic meanings. We construct a large-scale UI dataset for training and testing, and present a data augmentation approach to boost the detection performance.
arXiv Detail & Related papers (2022-12-07T03:50:20Z)
Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus [9.401663915424008]
We propose a vision-language model that only takes the screenshot of the UI and a region of interest on the screen as the input. Our experiments show that our model obtains SoTA results on several representative UI tasks and outperforms previous methods.
arXiv Detail & Related papers (2022-09-29T16:45:43Z)
ReverseORC: Reverse Engineering of Resizable User Interface Layouts with OR-Constraints [47.164878414034234]
ReverseORC is a novel reverse engineering (RE) approach to discover diverse layout types and their dynamic resizing behaviours. It can create specifications that replicate even some non-standard layout managers with complex dynamic layout behaviours. It can be used to detect and fix problems in legacy UIs, extend UIs with enhanced layout behaviours, and support the creation of flexible UI layouts.
arXiv Detail & Related papers (2022-02-23T13:57:25Z)
UIBert: Learning Generic Multimodal Representations for UI Understanding [12.931540149350633]
We introduce a transformer-based joint image-text model trained through novel pre-training tasks on large-scale unlabeled UI data. Our key intuition is that the heterogeneous features in a UI are self-aligned, i.e., the image and text features of UI components, are predictive of each other. We propose five pretraining tasks utilizing this self-alignment among different features of a UI component and across various components in the same UI. We evaluate our method on nine real-world downstream UI tasks where UIBert outperforms strong multimodal baselines by up to 9.26% accuracy.
arXiv Detail & Related papers (2021-07-29T03:51:36Z)
Magic Layouts: Structural Prior for Component Detection in User Interface Designs [28.394160581239174]
We present Magic Layouts; a method for parsing screenshots or hand-drawn sketches of user interface (UI) layouts. Our core contribution is to extend existing detectors to exploit a learned structural prior for UI designs. We demonstrate within the context an interactive application for rapidly acquiring digital prototypes of user experience (UX) designs.
arXiv Detail & Related papers (2021-06-14T17:20:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.