Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI
Components by Deep Learning
- URL: http://arxiv.org/abs/2003.00380v2
- Date: Thu, 2 Jul 2020 11:38:28 GMT
- Title: Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI
Components by Deep Learning
- Authors: Jieshan Chen, Chunyang Chen, Zhenchang Xing, Xiwei Xu, Liming Zhu,
Guoqiang Li, and Jinshui Wang
- Abstract summary: More than 77% apps have issues of missing labels, according to our analysis of 10,408 Android apps.
We develop a deep-learning based model, called LabelDroid, to automatically predict the labels of image-based buttons.
The experimental results show that our model can make accurate predictions and the generated labels are of higher quality than that from real Android developers.
- Score: 21.56849865328527
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: According to the World Health Organization(WHO), it is estimated that
approximately 1.3 billion people live with some forms of vision impairment
globally, of whom 36 million are blind. Due to their disability, engaging these
minority into the society is a challenging problem. The recent rise of smart
mobile phones provides a new solution by enabling blind users' convenient
access to the information and service for understanding the world. Users with
vision impairment can adopt the screen reader embedded in the mobile operating
systems to read the content of each screen within the app, and use gestures to
interact with the phone. However, the prerequisite of using screen readers is
that developers have to add natural-language labels to the image-based
components when they are developing the app. Unfortunately, more than 77% apps
have issues of missing labels, according to our analysis of 10,408 Android
apps. Most of these issues are caused by developers' lack of awareness and
knowledge in considering the minority. And even if developers want to add the
labels to UI components, they may not come up with concise and clear
description as most of them are of no visual issues. To overcome these
challenges, we develop a deep-learning based model, called LabelDroid, to
automatically predict the labels of image-based buttons by learning from
large-scale commercial apps in Google Play. The experimental results show that
our model can make accurate predictions and the generated labels are of higher
quality than that from real Android developers.
Related papers
- ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents [50.39555842254652]
We introduce the Android Multi-annotation EXpo (AMEX) to advance research on AI agents in mobile scenarios.
AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, which are annotated at multiple levels.
AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions.
arXiv Detail & Related papers (2024-07-03T17:59:58Z) - Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model [27.97964877860671]
This paper proposes a vision-driven automated GUI testing approach to detect non-crash functional bugs with Multimodal Large Language Models.
It begins by extracting GUI text information and aligning it with screenshots to form a vision prompt, enabling MLLM to understand GUI context.
VisionDroid identifies 29 new bugs on Google Play, of which 19 have been confirmed and fixed.
arXiv Detail & Related papers (2024-07-03T11:58:09Z) - Tell Me What's Next: Textual Foresight for Generic UI Representations [65.10591722192609]
We propose Textual Foresight, a novel pretraining objective for learning UI screen representations.
Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken.
We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning.
arXiv Detail & Related papers (2024-06-12T02:43:19Z) - Improve accessibility for Low Vision and Blind people using Machine Learning and Computer Vision [0.0]
This project explores how machine learning and computer vision could be utilized to improve accessibility for people with visual impairments.
This project will concentrate on building a mobile application that helps blind people to orient in space by receiving audio and haptic feedback.
arXiv Detail & Related papers (2024-03-24T21:19:17Z) - Towards Automated Accessibility Report Generation for Mobile Apps [14.908672785900832]
We propose a system to generate whole app accessibility reports.
It combines varied data collection methods (e.g., app crawling, manual recording) with an existing accessibility scanner.
arXiv Detail & Related papers (2023-09-29T19:05:11Z) - Automated and Context-Aware Repair of Color-Related Accessibility Issues
for Android Apps [28.880881834251227]
We propose Iris, an automated and context-aware repair method to fix color-related accessibility issues for apps.
By leveraging a novel context-aware technique, Iris resolves the optimal colors and a vital phase of attribute-to-repair localization.
Our experiments unveiled that Iris can achieve a 91.38% repair success rate with high effectiveness and efficiency.
arXiv Detail & Related papers (2023-08-17T15:03:11Z) - A Pairwise Dataset for GUI Conversion and Retrieval between Android
Phones and Tablets [24.208087862974033]
Papt dataset is a pairwise dataset for GUI conversion and retrieval between Android phones and tablets.
dataset contains 10,035 phone-tablet GUI page pairs from 5,593 phone-tablet app pairs.
arXiv Detail & Related papers (2023-07-25T03:25:56Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Fast and Accurate Quantized Camera Scene Detection on Smartphones,
Mobile AI 2021 Challenge: Report [65.91472671013302]
We introduce the first Mobile AI challenge, where the target is to develop quantized deep learning-based camera scene classification solutions.
The proposed solutions are fully compatible with all major mobile AI accelerators and can demonstrate more than 100-200 FPS on the majority of recent smartphone platforms.
arXiv Detail & Related papers (2021-05-17T13:55:38Z) - Skeleton Based Sign Language Recognition Using Whole-body Keypoints [71.97020373520922]
Sign language is used by deaf or speech impaired people to communicate.
Skeleton-based recognition is becoming popular that it can be further ensembled with RGB-D based method to achieve state-of-the-art performance.
Inspired by the recent development of whole-body pose estimation citejin 2020whole, we propose recognizing sign language based on the whole-body key points and features.
arXiv Detail & Related papers (2021-03-16T03:38:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.