Related papers: UITrans: Seamless UI Translation from Android to HarmonyOS

Related papers

GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent [66.34801160469067]
MLLMs suffer from two key issues: misinterpreting UI components and outdated knowledge.<n>We propose GUI-explorer, a training-free GUI agent that incorporates two fundamental mechanisms.<n>With a task success rate of 53.7% on SPA-Bench and 47.4% on AndroidWorld, GUI-explorer shows significant improvements over SOTA agents.
arXiv Detail & Related papers (2025-05-22T16:01:06Z)
VLM-Fuzz: Vision Language Model Assisted Recursive Depth-first Search Exploration for Effective UI Testing of Android Apps [6.122273281101832]
Testing Android apps effectively requires a systematic exploration of the app's possible states. We propose a novel, automated fuzzing approach called VLM-Fuzz for effective UI testing of Android apps.
arXiv Detail & Related papers (2025-04-16T00:19:31Z)
AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs [54.58905728115257]
We propose the methodname pipeline for automatically annotating UI elements with detailed functionality descriptions at scale. Specifically, we leverage large language models (LLMs) to infer element functionality by comparing the UI content changes before and after simulated interactions with specific UI elements. We construct an methodname-704k dataset using the proposed pipeline, featuring multi-resolution, multi-device screenshots, diverse data domains, and detailed functionality annotations that have never been provided by previous datasets.
arXiv Detail & Related papers (2025-02-04T03:39:59Z)
Falcon-UI: Understanding GUI Before Following User Instructions [57.67308498231232]
We introduce an instruction-free GUI navigation dataset, termed Insight-UI dataset, to enhance model comprehension of GUI environments.<n> Insight-UI dataset is automatically generated from the Common Crawl corpus, simulating various platforms.<n>We develop the GUI agent model Falcon-UI, which is initially pretrained on Insight-UI dataset and subsequently fine-tuned on Android and Web GUI datasets.
arXiv Detail & Related papers (2024-12-12T15:29:36Z)
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z)
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms [48.00193601902457]
Ferret-UI 2 is a multimodal large language model (MLLM) designed for universal UI understanding across a wide range of platforms. Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting.
arXiv Detail & Related papers (2024-10-24T17:58:31Z)
A Rule-Based Approach for UI Migration from Android to iOS [11.229343760409044]
We propose a novel approach called GUIMIGRATOR, which enables the cross platform migration of existing Android app UIs to iOS. GuiMIGRATOR extracts and parses Android UI layouts, views, and resources to construct a UI skeleton tree. GuiMIGRATOR generates the final UI code files utilizing target code templates, which are then compiled and validated in the iOS development platform.
arXiv Detail & Related papers (2024-09-25T06:19:54Z)
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents [50.39555842254652]
We introduce the Android Multi-annotation EXpo (AMEX) to advance research on AI agents in mobile scenarios. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, which are annotated at multiple levels. AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions.
arXiv Detail & Related papers (2024-07-03T17:59:58Z)
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices [61.48043339441149]
GUI Odyssey consists of 7,735 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 201 apps, and 1.4K app combos. We developed OdysseyAgent, a multimodal cross-app navigation agent by fine-tuning the Qwen-VL model with a history resampling module.
arXiv Detail & Related papers (2024-06-12T17:44:26Z)
Tell Me What's Next: Textual Foresight for Generic UI Representations [65.10591722192609]
We propose Textual Foresight, a novel pretraining objective for learning UI screen representations. Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken. We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning.
arXiv Detail & Related papers (2024-06-12T02:43:19Z)
UI Semantic Group Detection: Grouping UI Elements with Similar Semantics in Mobile Graphical User Interface [10.80156450091773]
Existing studies on UI elements grouping mainly focus on a single UI-related software engineering task, and their groups vary in appearance and function. We propose our semantic component groups that pack adjacent text and non-text elements with similar semantics. To recognize semantic component groups on a UI page, we propose a robust, deep learning-based vision detector, UISCGD.
arXiv Detail & Related papers (2024-03-08T01:52:44Z)
Pairwise GUI Dataset Construction Between Android Phones and Tablets [24.208087862974033]
Papt dataset is a pairwise GUI dataset tailored for Android phones and tablets. We propose novel pairwise GUI collection approaches for constructing this dataset.
arXiv Detail & Related papers (2023-10-07T09:30:42Z)
Meta-Transformer: A Unified Framework for Multimodal Learning [105.77219833997962]
Multimodal learning aims to build models that process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities. We propose a framework, named Meta-Transformer, that leverages a $textbffrozen$ encoder to perform multimodal perception.
arXiv Detail & Related papers (2023-07-20T12:10:29Z)
Android in the Wild: A Large-Scale Dataset for Android Device Control [4.973591165982018]
We present a dataset for device-control research, Android in the Wild (AITW) The dataset contains human demonstrations of device interactions, including the screens and actions, and corresponding natural language instructions. It consists of 715k episodes spanning 30k unique instructions, four versions of Android (v10-13),and eight device types (Pixel 2 XL to Pixel 6) with varying screen resolutions.
arXiv Detail & Related papers (2023-07-19T15:57:24Z)
ReverseORC: Reverse Engineering of Resizable User Interface Layouts with OR-Constraints [47.164878414034234]
ReverseORC is a novel reverse engineering (RE) approach to discover diverse layout types and their dynamic resizing behaviours. It can create specifications that replicate even some non-standard layout managers with complex dynamic layout behaviours. It can be used to detect and fix problems in legacy UIs, extend UIs with enhanced layout behaviours, and support the creation of flexible UI layouts.
arXiv Detail & Related papers (2022-02-23T13:57:25Z)
Learning to Denoise Raw Mobile UI Layouts for Improving Datasets at Scale [7.6774030932546315]
We propose a deep learning pipeline for denoising user interface ( UI) layouts. Our pipeline annotates the raw layout by removing incorrect nodes and assigning a semantically meaningful type to each node. Our deep models achieve high accuracy with F1 scores of 82.7% for detecting layout objects that do not have a valid visual representation.
arXiv Detail & Related papers (2022-01-11T17:52:40Z)
UIBert: Learning Generic Multimodal Representations for UI Understanding [12.931540149350633]
We introduce a transformer-based joint image-text model trained through novel pre-training tasks on large-scale unlabeled UI data. Our key intuition is that the heterogeneous features in a UI are self-aligned, i.e., the image and text features of UI components, are predictive of each other. We propose five pretraining tasks utilizing this self-alignment among different features of a UI component and across various components in the same UI. We evaluate our method on nine real-world downstream UI tasks where UIBert outperforms strong multimodal baselines by up to 9.26% accuracy.
arXiv Detail & Related papers (2021-07-29T03:51:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.