GhostUI: Unveiling Hidden Interactions in Mobile UI
- URL: http://arxiv.org/abs/2601.19258v1
- Date: Tue, 27 Jan 2026 06:40:29 GMT
- Title: GhostUI: Unveiling Hidden Interactions in Mobile UI
- Authors: Minkyu Kweon, Seokhyeon Park, Soohyun Lee, You Been Lee, Jeongmin Rhee, Jinwook Seo,
- Abstract summary: GhostUI is a new dataset designed to enable the detection of hidden interactions in mobile applications.<n>GhostUI provides before-and-after screenshots, simplified view hierarchies, gesture metadata, and task descriptions.
- Score: 12.023496228003337
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern mobile applications rely on hidden interactions--gestures without visual cues like long presses and swipes--to provide functionality without cluttering interfaces. While experienced users may discover these interactions through prior use or onboarding tutorials, their implicit nature makes them difficult for most users to uncover. Similarly, mobile agents--systems designed to automate tasks on mobile user interfaces, powered by vision language models (VLMs)--struggle to detect veiled interactions or determine actions for completing tasks. To address this challenge, we present GhostUI, a new dataset designed to enable the detection of hidden interactions in mobile applications. GhostUI provides before-and-after screenshots, simplified view hierarchies, gesture metadata, and task descriptions, allowing VLMs to better recognize concealed gestures and anticipate post-interaction states. Quantitative evaluations with VLMs show that models fine-tuned on GhostUI outperform baseline VLMs, particularly in predicting hidden interactions and inferring post-interaction screens, underscoring GhostUI's potential as a foundation for advancing mobile task automation.
Related papers
- PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents [151.86841216364294]
We propose textbfPAL-UI (textbfPlanning with textbfActive textbfLook-back), a novel framework that enables GUI agents to adaptively retrieve past observations when required.<n> PAL-UI combines a dual-level summarization agent, capturing both observation-level cues and action-level outcomes, with a dedicated retrieval tool.
arXiv Detail & Related papers (2025-10-01T01:48:39Z) - Generative Interfaces for Language Models [70.25765232527762]
We propose a paradigm in which large language models (LLMs) respond to user queries by proactively generating user interfaces (UIs)<n>Our framework leverages structured interface-specific representations and iterative refinements to translate user queries into task-specific UIs.<n>Results show that generative interfaces consistently outperform conversational ones, with up to a 72% improvement in human preference.
arXiv Detail & Related papers (2025-08-26T17:43:20Z) - Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection [51.52749744031413]
Human-Object Interaction (HOI) detection aims to identify humans and objects within images and interpret their interactions.<n>Existing HOI methods rely heavily on large datasets with manual annotations to learn interactions from visual cues.<n>We propose a novel training-free HOI detection framework for Dynamic Scoring with enhanced semantics.
arXiv Detail & Related papers (2025-07-23T12:30:19Z) - Creating General User Models from Computer Use [53.59999173952482]
This paper presents an architecture for a general user model (GUM) that learns about you by observing any interaction you have with your computer.<n>The GUM takes as input any unstructured observation of a user (e.g., device screenshots) and constructs confidence-weighted propositions that capture user knowledge and preferences.
arXiv Detail & Related papers (2025-05-16T04:00:31Z) - Advancing Mobile UI Testing by Learning Screen Usage Semantics [0.42303492200814446]
This research seeks to enhance automated UI testing techniques by learning the screen usage semantics of mobile apps.<n>It also improves the usability of a mobile app's interface by identifying and mitigating UI design issues.
arXiv Detail & Related papers (2025-05-15T01:40:43Z) - Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks [85.48034185086169]
Mobile-Agent-E is a hierarchical multi-agent framework capable of self-evolution through past experience.<n>Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches.
arXiv Detail & Related papers (2025-01-20T20:35:46Z) - MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding [37.15649883702765]
We propose MobileVLM, which includes two additional pre-training stages to enhance both intra- and inter-UI understanding.
To address the lack of mobile pre-training data, we built a large Chinese mobile dataset Mobile3M from scratch, which contains 3 million UI pages.
Experimental results show MobileVLM excels on both our test set and public mobile benchmarks, outperforming existing VLMs.
arXiv Detail & Related papers (2024-09-23T08:47:54Z) - Training a Vision Language Model as Smartphone Assistant [1.3654846342364308]
We present a visual language model (VLM) that can fulfill diverse tasks on mobile devices.
Our model functions by interacting solely with the user interface (UI)
Unlike previous methods, our model operates not only on a single screen image but on vision-language sentences created from sequences of past screenshots.
arXiv Detail & Related papers (2024-04-12T18:28:44Z) - Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction [28.53259866617677]
We introduce Mobile-Env, a comprehensive toolkit tailored for creating GUI benchmarks in the Android mobile environment.
We collect an open-world task set across various real-world apps and a fixed world set, WikiHow, which captures a significant amount of dynamic online contents.
Our findings reveal that even advanced models struggle with tasks that are relatively simple for humans.
arXiv Detail & Related papers (2023-05-14T12:31:03Z) - Spotlight: Mobile UI Understanding using Vision-Language Models with a
Focus [9.401663915424008]
We propose a vision-language model that only takes the screenshot of the UI and a region of interest on the screen as the input.
Our experiments show that our model obtains SoTA results on several representative UI tasks and outperforms previous methods.
arXiv Detail & Related papers (2022-09-29T16:45:43Z) - Enabling Conversational Interaction with Mobile UI using Large Language
Models [15.907868408556885]
To perform diverse UI tasks with natural language, developers typically need to create separate datasets and models for each specific task.
This paper investigates the feasibility of enabling versatile conversational interactions with mobile UIs using a single language model.
arXiv Detail & Related papers (2022-09-18T20:58:39Z) - First Contact: Unsupervised Human-Machine Co-Adaptation via Mutual
Information Maximization [112.40598205054994]
We formalize this idea as a completely unsupervised objective for optimizing interfaces.
We conduct an observational study on 540K examples of users operating various keyboard and eye gaze interfaces for typing, controlling simulated robots, and playing video games.
The results show that our mutual information scores are predictive of the ground-truth task completion metrics in a variety of domains.
arXiv Detail & Related papers (2022-05-24T21:57:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.