Understanding Visual Saliency in Mobile User Interfaces
- URL: http://arxiv.org/abs/2101.09176v1
- Date: Fri, 22 Jan 2021 15:45:13 GMT
- Title: Understanding Visual Saliency in Mobile User Interfaces
- Authors: Luis A. Leiva, Yunfei Xue, Avya Bansal, Hamed R. Tavakoli,
Tu\u{g}\c{c}e K\"oro\u{g}lu, Niraj R. Dayama, Antti Oulasvirta
- Abstract summary: We present findings from a controlled study with 30 participants and 193 mobile UIs.
Results speak to a role of expectations in guiding where users look at.
We release the first annotated dataset for investigating visual saliency in mobile UIs.
- Score: 31.278845008743698
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For graphical user interface (UI) design, it is important to understand what
attracts visual attention. While previous work on saliency has focused on
desktop and web-based UIs, mobile app UIs differ from these in several
respects. We present findings from a controlled study with 30 participants and
193 mobile UIs. The results speak to a role of expectations in guiding where
users look at. Strong bias toward the top-left corner of the display, text, and
images was evident, while bottom-up features such as color or size affected
saliency less. Classic, parameter-free saliency models showed a weak fit with
the data, and data-driven models improved significantly when trained
specifically on this dataset (e.g., NSS rose from 0.66 to 0.84). We also
release the first annotated dataset for investigating visual saliency in mobile
UIs.
Related papers
- ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - Tell Me What's Next: Textual Foresight for Generic UI Representations [65.10591722192609]
We propose Textual Foresight, a novel pretraining objective for learning UI screen representations.
Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken.
We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning.
arXiv Detail & Related papers (2024-06-12T02:43:19Z) - Towards Better Semantic Understanding of Mobile Interfaces [7.756895821262432]
We release a human-annotated dataset with approximately 500k unique annotations aimed at increasing the understanding of the functionality of UI elements.
This dataset augments images and view hierarchies from RICO, a large dataset of mobile UIs.
We also release models using image-only and multimodal inputs; we experiment with various architectures and study the benefits of using multimodal inputs on the new dataset.
arXiv Detail & Related papers (2022-10-06T03:48:54Z) - Spotlight: Mobile UI Understanding using Vision-Language Models with a
Focus [9.401663915424008]
We propose a vision-language model that only takes the screenshot of the UI and a region of interest on the screen as the input.
Our experiments show that our model obtains SoTA results on several representative UI tasks and outperforms previous methods.
arXiv Detail & Related papers (2022-09-29T16:45:43Z) - Predicting and Explaining Mobile UI Tappability with Vision Modeling and
Saliency Analysis [15.509241935245585]
We use a deep learning based approach to predict whether a selected element in a mobile UI screenshot will be perceived by users as tappable.
We additionally use ML interpretability techniques to help explain the output of our model.
arXiv Detail & Related papers (2022-04-05T18:51:32Z) - Learning to Denoise Raw Mobile UI Layouts for Improving Datasets at
Scale [7.6774030932546315]
We propose a deep learning pipeline for denoising user interface ( UI) layouts.
Our pipeline annotates the raw layout by removing incorrect nodes and assigning a semantically meaningful type to each node.
Our deep models achieve high accuracy with F1 scores of 82.7% for detecting layout objects that do not have a valid visual representation.
arXiv Detail & Related papers (2022-01-11T17:52:40Z) - Perceptual Score: What Data Modalities Does Your Model Perceive? [73.75255606437808]
We introduce the perceptual score, a metric that assesses the degree to which a model relies on the different subsets of the input features.
We find that recent, more accurate multi-modal models for visual question-answering tend to perceive the visual data less than their predecessors.
Using the perceptual score also helps to analyze model biases by decomposing the score into data subset contributions.
arXiv Detail & Related papers (2021-10-27T12:19:56Z) - VINS: Visual Search for Mobile User Interface Design [66.28088601689069]
This paper introduces VINS, a visual search framework, that takes as input a UI image and retrieves visually similar design examples.
The framework achieves a mean Average Precision of 76.39% for the UI detection and high performance in querying similar UI designs.
arXiv Detail & Related papers (2021-02-10T01:46:33Z) - ActionBert: Leveraging User Actions for Semantic Understanding of User
Interfaces [12.52699475631247]
We introduce a new pre-trained UI representation model called ActionBert.
Our methodology is designed to leverage visual, linguistic and domain-specific features in user interaction traces to pre-train generic feature representations of UIs and their components.
Experiments show that the proposed ActionBert model outperforms multi-modal baselines across all downstream tasks by up to 15.5%.
arXiv Detail & Related papers (2020-12-22T20:49:52Z) - User-Guided Domain Adaptation for Rapid Annotation from User
Interactions: A Study on Pathological Liver Segmentation [49.96706092808873]
Mask-based annotation of medical images, especially for 3D data, is a bottleneck in developing reliable machine learning models.
We propose the user-guided domain adaptation (UGDA) framework, which uses prediction-based adversarial domain adaptation (PADA) to model the combined distribution of UIs and mask predictions.
We show UGDA can retain this state-of-the-art performance even when only seeing a fraction of available UIs.
arXiv Detail & Related papers (2020-09-05T04:24:58Z) - Towards End-to-end Video-based Eye-Tracking [50.0630362419371]
Estimating eye-gaze from images alone is a challenging task due to un-observable person-specific factors.
We propose a novel dataset and accompanying method which aims to explicitly learn these semantic and temporal relationships.
We demonstrate that the fusion of information from visual stimuli as well as eye images can lead towards achieving performance similar to literature-reported figures.
arXiv Detail & Related papers (2020-07-26T12:39:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.