VINS: Visual Search for Mobile User Interface Design
- URL: http://arxiv.org/abs/2102.05216v1
- Date: Wed, 10 Feb 2021 01:46:33 GMT
- Title: VINS: Visual Search for Mobile User Interface Design
- Authors: Sara Bunian, Kai Li, Chaima Jemmali, Casper Harteveld, Yun Fu, Magy
Seif El-Nasr
- Abstract summary: This paper introduces VINS, a visual search framework, that takes as input a UI image and retrieves visually similar design examples.
The framework achieves a mean Average Precision of 76.39% for the UI detection and high performance in querying similar UI designs.
- Score: 66.28088601689069
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Searching for relative mobile user interface (UI) design examples can aid
interface designers in gaining inspiration and comparing design alternatives.
However, finding such design examples is challenging, especially as current
search systems rely on only text-based queries and do not consider the UI
structure and content into account. This paper introduces VINS, a visual search
framework, that takes as input a UI image (wireframe, high-fidelity) and
retrieves visually similar design examples. We first survey interface designers
to better understand their example finding process. We then develop a
large-scale UI dataset that provides an accurate specification of the
interface's view hierarchy (i.e., all the UI components and their specific
location). By utilizing this dataset, we propose an object-detection based
image retrieval framework that models the UI context and hierarchical
structure. The framework achieves a mean Average Precision of 76.39\% for the
UI detection and high performance in querying similar UI designs.
Related papers
- ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - Harnessing Webpage UIs for Text-Rich Visual Understanding [112.01029887404296]
We propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs)
These instructions are then paired with UI screenshots to train multimodal models.
We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts.
arXiv Detail & Related papers (2024-10-17T17:48:54Z) - Tell Me What's Next: Textual Foresight for Generic UI Representations [65.10591722192609]
We propose Textual Foresight, a novel pretraining objective for learning UI screen representations.
Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken.
We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning.
arXiv Detail & Related papers (2024-06-12T02:43:19Z) - UIClip: A Data-driven Model for Assessing User Interface Design [20.66914084220734]
We develop a machine-learned model, UIClip, for assessing the design quality and visual relevance of a user interface.
We show how UIClip can facilitate downstream applications that rely on instantaneous assessment of UI design quality.
arXiv Detail & Related papers (2024-04-18T20:43:08Z) - OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition [79.852642726105]
We propose a unified paradigm for parsing visually-situated text across diverse scenarios.
Specifically, we devise a universal model, called Omni, which can simultaneously handle three typical visually-situated text parsing tasks.
In Omni, all tasks share the unified encoder-decoder architecture, the unified objective point-conditioned text generation, and the unified input representation.
arXiv Detail & Related papers (2024-03-28T03:51:14Z) - UI Layers Group Detector: Grouping UI Layers via Text Fusion and Box
Attention [7.614630088064978]
We propose a vision-based method that automatically detects images (i.e., basic shapes and visual elements) and text layers that present the same semantic meanings.
We construct a large-scale UI dataset for training and testing, and present a data augmentation approach to boost the detection performance.
arXiv Detail & Related papers (2022-12-07T03:50:20Z) - Spotlight: Mobile UI Understanding using Vision-Language Models with a
Focus [9.401663915424008]
We propose a vision-language model that only takes the screenshot of the UI and a region of interest on the screen as the input.
Our experiments show that our model obtains SoTA results on several representative UI tasks and outperforms previous methods.
arXiv Detail & Related papers (2022-09-29T16:45:43Z) - ReverseORC: Reverse Engineering of Resizable User Interface Layouts with
OR-Constraints [47.164878414034234]
ReverseORC is a novel reverse engineering (RE) approach to discover diverse layout types and their dynamic resizing behaviours.
It can create specifications that replicate even some non-standard layout managers with complex dynamic layout behaviours.
It can be used to detect and fix problems in legacy UIs, extend UIs with enhanced layout behaviours, and support the creation of flexible UI layouts.
arXiv Detail & Related papers (2022-02-23T13:57:25Z) - UIBert: Learning Generic Multimodal Representations for UI Understanding [12.931540149350633]
We introduce a transformer-based joint image-text model trained through novel pre-training tasks on large-scale unlabeled UI data.
Our key intuition is that the heterogeneous features in a UI are self-aligned, i.e., the image and text features of UI components, are predictive of each other.
We propose five pretraining tasks utilizing this self-alignment among different features of a UI component and across various components in the same UI.
We evaluate our method on nine real-world downstream UI tasks where UIBert outperforms strong multimodal baselines by up to 9.26% accuracy.
arXiv Detail & Related papers (2021-07-29T03:51:36Z) - Magic Layouts: Structural Prior for Component Detection in User
Interface Designs [28.394160581239174]
We present Magic Layouts; a method for parsing screenshots or hand-drawn sketches of user interface (UI) layouts.
Our core contribution is to extend existing detectors to exploit a learned structural prior for UI designs.
We demonstrate within the context an interactive application for rapidly acquiring digital prototypes of user experience (UX) designs.
arXiv Detail & Related papers (2021-06-14T17:20:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.