Multimodal Icon Annotation For Mobile Applications
- URL: http://arxiv.org/abs/2107.04452v1
- Date: Fri, 9 Jul 2021 13:57:37 GMT
- Title: Multimodal Icon Annotation For Mobile Applications
- Authors: Xiaoxue Zang, Ying Xu, Jindong Chen
- Abstract summary: We propose a novel deep learning based multi-modal approach that combines the benefits of both pixel and view hierarchy features.
In order to demonstrate the utility provided, we create a high quality UI dataset by manually annotating the most commonly used 29 icons in Rico.
- Score: 11.342641993269693
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Annotating user interfaces (UIs) that involves localization and
classification of meaningful UI elements on a screen is a critical step for
many mobile applications such as screen readers and voice control of devices.
Annotating object icons, such as menu, search, and arrow backward, is
especially challenging due to the lack of explicit labels on screens, their
similarity to pictures, and their diverse shapes. Existing studies either use
view hierarchy or pixel based methods to tackle the task. Pixel based
approaches are more popular as view hierarchy features on mobile platforms are
often incomplete or inaccurate, however it leaves out instructional information
in the view hierarchy such as resource-ids or content descriptions. We propose
a novel deep learning based multi-modal approach that combines the benefits of
both pixel and view hierarchy features as well as leverages the
state-of-the-art object detection techniques. In order to demonstrate the
utility provided, we create a high quality UI dataset by manually annotating
the most commonly used 29 icons in Rico, a large scale mobile design dataset
consisting of 72k UI screenshots. The experimental results indicate the
effectiveness of our multi-modal approach. Our model not only outperforms a
widely used object classification baseline but also pixel based object
detection models. Our study sheds light on how to combine view hierarchy with
pixel features for annotating UI elements.
Related papers
- OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - Deep Models for Multi-View 3D Object Recognition: A Review [16.500711021549947]
Multi-view 3D representations for object recognition has thus far demonstrated the most promising results for achieving state-of-the-art performance.
This review paper comprehensively covers recent progress in multi-view 3D object recognition methods for 3D classification and retrieval tasks.
arXiv Detail & Related papers (2024-04-23T16:54:31Z) - Computer User Interface Understanding. A New Dataset and a Learning Framework [2.4473568032515147]
We introduce the harder task of computer UI understanding.
We present a dataset with a set of videos where a user is performing a sequence of actions and each image shows the desktop contents at that time point.
We also present a framework that is composed of a synthetic sample generation pipeline to augment the dataset with relevant characteristics.
arXiv Detail & Related papers (2024-03-15T10:26:52Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - Towards Better Semantic Understanding of Mobile Interfaces [7.756895821262432]
We release a human-annotated dataset with approximately 500k unique annotations aimed at increasing the understanding of the functionality of UI elements.
This dataset augments images and view hierarchies from RICO, a large dataset of mobile UIs.
We also release models using image-only and multimodal inputs; we experiment with various architectures and study the benefits of using multimodal inputs on the new dataset.
arXiv Detail & Related papers (2022-10-06T03:48:54Z) - Spotlight: Mobile UI Understanding using Vision-Language Models with a
Focus [9.401663915424008]
We propose a vision-language model that only takes the screenshot of the UI and a region of interest on the screen as the input.
Our experiments show that our model obtains SoTA results on several representative UI tasks and outperforms previous methods.
arXiv Detail & Related papers (2022-09-29T16:45:43Z) - Multi-level Second-order Few-shot Learning [111.0648869396828]
We propose a Multi-level Second-order (MlSo) few-shot learning network for supervised or unsupervised few-shot image classification and few-shot action recognition.
We leverage so-called power-normalized second-order base learner streams combined with features that express multiple levels of visual abstraction.
We demonstrate respectable results on standard datasets such as Omniglot, mini-ImageNet, tiered-ImageNet, Open MIC, fine-grained datasets such as CUB Birds, Stanford Dogs and Cars, and action recognition datasets such as HMDB51, UCF101, and mini-MIT.
arXiv Detail & Related papers (2022-01-15T19:49:00Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z) - An Automatic Image Content Retrieval Method for better Mobile Device
Display User Experiences [91.3755431537592]
A new mobile application for image content retrieval and classification for mobile device display is proposed.
The application was run on thousands of pictures and showed encouraging results towards a better user visual experience with mobile displays.
arXiv Detail & Related papers (2021-08-26T23:44:34Z) - A Simple and Effective Use of Object-Centric Images for Long-Tailed
Object Detection [56.82077636126353]
We take advantage of object-centric images to improve object detection in scene-centric images.
We present a simple yet surprisingly effective framework to do so.
Our approach can improve the object detection (and instance segmentation) accuracy of rare objects by 50% (and 33%) relatively.
arXiv Detail & Related papers (2021-02-17T17:27:21Z) - ActionBert: Leveraging User Actions for Semantic Understanding of User
Interfaces [12.52699475631247]
We introduce a new pre-trained UI representation model called ActionBert.
Our methodology is designed to leverage visual, linguistic and domain-specific features in user interaction traces to pre-train generic feature representations of UIs and their components.
Experiments show that the proposed ActionBert model outperforms multi-modal baselines across all downstream tasks by up to 15.5%.
arXiv Detail & Related papers (2020-12-22T20:49:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.