Reinforced UI Instruction Grounding: Towards a Generic UI Task
Automation API
- URL: http://arxiv.org/abs/2310.04716v1
- Date: Sat, 7 Oct 2023 07:22:41 GMT
- Title: Reinforced UI Instruction Grounding: Towards a Generic UI Task
Automation API
- Authors: Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, Yan Lu
- Abstract summary: We build a multimodal model to ground natural language instructions in given UI screenshots as a generic UI task automation executor.
To facilitate the exploitation of image-to-text pretrained knowledge, we follow the pixel-to-sequence paradigm.
Our proposed reinforced UI instruction grounding model outperforms the state-of-the-art methods by a clear margin.
- Score: 17.991044940694778
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent popularity of Large Language Models (LLMs) has opened countless
possibilities in automating numerous AI tasks by connecting LLMs to various
domain-specific models or APIs, where LLMs serve as dispatchers while
domain-specific models or APIs are action executors. Despite the vast numbers
of domain-specific models/APIs, they still struggle to comprehensively cover
super diverse automation demands in the interaction between human and User
Interfaces (UIs). In this work, we build a multimodal model to ground natural
language instructions in given UI screenshots as a generic UI task automation
executor. This metadata-free grounding model, consisting of a visual encoder
and a language decoder, is first pretrained on well studied document
understanding tasks and then learns to decode spatial information from UI
screenshots in a promptable way. To facilitate the exploitation of
image-to-text pretrained knowledge, we follow the pixel-to-sequence paradigm to
predict geometric coordinates in a sequence of tokens using a language decoder.
We further propose an innovative Reinforcement Learning (RL) based algorithm to
supervise the tokens in such sequence jointly with visually semantic metrics,
which effectively strengthens the spatial decoding capability of the
pixel-to-sequence paradigm. Extensive experiments demonstrate our proposed
reinforced UI instruction grounding model outperforms the state-of-the-art
methods by a clear margin and shows the potential as a generic UI task
automation API.
Related papers
- Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only [21.054681757006385]
We propose an agent that perceives its environment solely through screenshot images.
By leveraging the reasoning capability of the Large Language Models, we eliminate the need for large-scale human demonstration data.
Agent achieves an average success rate of 94.5% on MiniWoB++ and an average task score of 62.3 on WebShop.
arXiv Detail & Related papers (2024-06-11T05:21:20Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion [81.7514869897233]
We develop a training-free Multimodal-LLM agent (MuLan), as a human painter, that can progressively generate multi-object.
MuLan harnesses a large language model (LLM) to decompose a prompt to a sequence of sub-tasks, each generating only one object by stable diffusion.
MuLan also adopts a vision-language model (VLM) to provide feedback to the image generated in each sub-task and control the diffusion model to re-generate the image if it violates the original prompt.
arXiv Detail & Related papers (2024-02-20T06:14:30Z) - CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation [61.68049335444254]
Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments.
We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP)
With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
arXiv Detail & Related papers (2024-02-19T08:29:03Z) - ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine
Conversations [13.939350184164017]
Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language.
We adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM)
We generate a dataset of 335K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks.
arXiv Detail & Related papers (2023-10-07T16:32:34Z) - You Only Look at Screens: Multimodal Chain-of-Action Agents [37.118034745972956]
Auto-GUI is a multimodal solution that directly interacts with the interface.
We propose a chain-of-action technique to help the agent decide what action to execute.
We evaluate our approach on a new device-control benchmark AITW with 30$K$ unique instructions.
arXiv Detail & Related papers (2023-09-20T16:12:32Z) - AutoML-GPT: Automatic Machine Learning with GPT [74.30699827690596]
We propose developing task-oriented prompts and automatically utilizing large language models (LLMs) to automate the training pipeline.
We present the AutoML-GPT, which employs GPT as the bridge to diverse AI models and dynamically trains models with optimized hyper parameters.
This approach achieves remarkable results in computer vision, natural language processing, and other challenging areas.
arXiv Detail & Related papers (2023-05-04T02:09:43Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - LAVENDER: Unifying Video-Language Understanding as Masked Language
Modeling [102.42424022921243]
Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks.
Experiments show that this unified framework achieves competitive performance on 14 VidL benchmarks.
arXiv Detail & Related papers (2022-06-14T20:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.