Related papers: Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI Components by Deep Learning

Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI Components by Deep Learning

URL: http://arxiv.org/abs/2003.00380v2
Date: Thu, 2 Jul 2020 11:38:28 GMT
Title: Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI Components by Deep Learning
Authors: Jieshan Chen, Chunyang Chen, Zhenchang Xing, Xiwei Xu, Liming Zhu, Guoqiang Li, and Jinshui Wang
Abstract summary: More than 77% apps have issues of missing labels, according to our analysis of 10,408 Android apps. We develop a deep-learning based model, called LabelDroid, to automatically predict the labels of image-based buttons. The experimental results show that our model can make accurate predictions and the generated labels are of higher quality than that from real Android developers.
Score: 21.56849865328527
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: According to the World Health Organization(WHO), it is estimated that approximately 1.3 billion people live with some forms of vision impairment globally, of whom 36 million are blind. Due to their disability, engaging these minority into the society is a challenging problem. The recent rise of smart mobile phones provides a new solution by enabling blind users' convenient access to the information and service for understanding the world. Users with vision impairment can adopt the screen reader embedded in the mobile operating systems to read the content of each screen within the app, and use gestures to interact with the phone. However, the prerequisite of using screen readers is that developers have to add natural-language labels to the image-based components when they are developing the app. Unfortunately, more than 77% apps have issues of missing labels, according to our analysis of 10,408 Android apps. Most of these issues are caused by developers' lack of awareness and knowledge in considering the minority. And even if developers want to add the labels to UI components, they may not come up with concise and clear description as most of them are of no visual issues. To overcome these challenges, we develop a deep-learning based model, called LabelDroid, to automatically predict the labels of image-based buttons by learning from large-scale commercial apps in Google Play. The experimental results show that our model can make accurate predictions and the generated labels are of higher quality than that from real Android developers.

Related papers

Are your apps accessible? A GCN-based accessibility checker for low vision users [22.747735521796077]
We propose a novel approach, named ALVIN, which represents the Graphical User Interface as a graph and adopts the Graph Convolutional Neural Networks (GCN) to label inaccessible components. Experiments on 48 apps demonstrate the effectiveness of ALVIN, with precision of 83.5%, recall of 78.9%, and F1-score of 81.2%, outperforming baseline methods.
arXiv Detail & Related papers (2025-02-20T06:04:06Z)
Falcon-UI: Understanding GUI Before Following User Instructions [57.67308498231232]
We introduce an instruction-free GUI navigation dataset, termed Insight-UI dataset, to enhance model comprehension of GUI environments. Insight-UI dataset is automatically generated from the Common Crawl corpus, simulating various platforms. We develop the GUI agent model Falcon-UI, which is initially pretrained on Insight-UI dataset and subsequently fine-tuned on Android and Web GUI datasets.
arXiv Detail & Related papers (2024-12-12T15:29:36Z)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations. ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z)
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents [50.39555842254652]
We introduce the Android Multi-annotation EXpo (AMEX) to advance research on AI agents in mobile scenarios. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, which are annotated at multiple levels. AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions.
arXiv Detail & Related papers (2024-07-03T17:59:58Z)
Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model [27.97964877860671]
This paper proposes a vision-driven automated GUI testing approach to detect non-crash functional bugs with Multimodal Large Language Models. It begins by extracting GUI text information and aligning it with screenshots to form a vision prompt, enabling MLLM to understand GUI context. VisionDroid identifies 29 new bugs on Google Play, of which 19 have been confirmed and fixed.
arXiv Detail & Related papers (2024-07-03T11:58:09Z)
Tell Me What's Next: Textual Foresight for Generic UI Representations [65.10591722192609]
We propose Textual Foresight, a novel pretraining objective for learning UI screen representations. Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken. We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning.
arXiv Detail & Related papers (2024-06-12T02:43:19Z)
Improve accessibility for Low Vision and Blind people using Machine Learning and Computer Vision [0.0]
This project explores how machine learning and computer vision could be utilized to improve accessibility for people with visual impairments. This project will concentrate on building a mobile application that helps blind people to orient in space by receiving audio and haptic feedback.
arXiv Detail & Related papers (2024-03-24T21:19:17Z)
Towards Automated Accessibility Report Generation for Mobile Apps [14.908672785900832]
We propose a system to generate whole app accessibility reports. It combines varied data collection methods (e.g., app crawling, manual recording) with an existing accessibility scanner.
arXiv Detail & Related papers (2023-09-29T19:05:11Z)
Automated and Context-Aware Repair of Color-Related Accessibility Issues for Android Apps [28.880881834251227]
We propose Iris, an automated and context-aware repair method to fix color-related accessibility issues for apps. By leveraging a novel context-aware technique, Iris resolves the optimal colors and a vital phase of attribute-to-repair localization. Our experiments unveiled that Iris can achieve a 91.38% repair success rate with high effectiveness and efficiency.
arXiv Detail & Related papers (2023-08-17T15:03:11Z)
A Pairwise Dataset for GUI Conversion and Retrieval between Android Phones and Tablets [24.208087862974033]
Papt dataset is a pairwise dataset for GUI conversion and retrieval between Android phones and tablets. dataset contains 10,035 phone-tablet GUI page pairs from 5,593 phone-tablet app pairs.
arXiv Detail & Related papers (2023-07-25T03:25:56Z)
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding. We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset. In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z)
Fast and Accurate Quantized Camera Scene Detection on Smartphones, Mobile AI 2021 Challenge: Report [65.91472671013302]
We introduce the first Mobile AI challenge, where the target is to develop quantized deep learning-based camera scene classification solutions. The proposed solutions are fully compatible with all major mobile AI accelerators and can demonstrate more than 100-200 FPS on the majority of recent smartphone platforms.
arXiv Detail & Related papers (2021-05-17T13:55:38Z)
Skeleton Based Sign Language Recognition Using Whole-body Keypoints [71.97020373520922]
Sign language is used by deaf or speech impaired people to communicate. Skeleton-based recognition is becoming popular that it can be further ensembled with RGB-D based method to achieve state-of-the-art performance. Inspired by the recent development of whole-body pose estimation citejin 2020whole, we propose recognizing sign language based on the whole-body key points and features.
arXiv Detail & Related papers (2021-03-16T03:38:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.