Android in the Zoo: Chain-of-Action-Thought for GUI Agents
- URL: http://arxiv.org/abs/2403.02713v2
- Date: Sat, 13 Jul 2024 02:12:30 GMT
- Title: Android in the Zoo: Chain-of-Action-Thought for GUI Agents
- Authors: Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, Duyu Tang,
- Abstract summary: This work presents Chain-of-Action-Thought (dubbed CoAT), which takes the description of the previous actions, the current screen, and more importantly the action thinking of what actions should be performed and the outcomes led by the chosen action.
We demonstrate that, in a zero-shot setting upon three off-the-shelf LMMs, CoAT significantly improves the action prediction compared to previous proposed context modeling.
To further facilitate the research in this line, we construct a dataset Android-In-The-Zoo (AitZ), which contains 18,643 screen-action pairs together with chain-of-action
- Score: 38.07337874116759
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language model (LLM) leads to a surge of autonomous GUI agents for smartphone, which completes a task triggered by natural language through predicting a sequence of actions of API. Even though the task highly relies on past actions and visual observations, existing studies typically consider little semantic information carried out by intermediate screenshots and screen operations. To address this, this work presents Chain-of-Action-Thought (dubbed CoAT), which takes the description of the previous actions, the current screen, and more importantly the action thinking of what actions should be performed and the outcomes led by the chosen action. We demonstrate that, in a zero-shot setting upon three off-the-shelf LMMs, CoAT significantly improves the action prediction compared to previous proposed context modeling. To further facilitate the research in this line, we construct a dataset Android-In-The-Zoo (AitZ), which contains 18,643 screen-action pairs together with chain-of-action-thought annotations. Experiments show that fine-tuning a 1B model (i.e. AUTO-UI-base) on our AitZ dataset achieves on-par performance with CogAgent-Chat-18B.
Related papers
- Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - TreeSBA: Tree-Transformer for Self-Supervised Sequential Brick Assembly [51.29305265324916]
We propose a class-agnostic tree-transformer framework to predict the sequential assembly actions from input multi-view images.
A major challenge of the sequential brick assembly task is that the step-wise action labels are costly and tedious to obtain in practice.
We mitigate this problem by leveraging synthetic-to-real transfer learning.
arXiv Detail & Related papers (2024-07-22T14:05:27Z) - Semantically Guided Representation Learning For Action Anticipation [9.836788915947924]
We propose the novel Semantically Guided Representation Learning (S-GEAR) framework.
S-GEAR learns visual action prototypes and leverages language models to structure their relationship, inducing semanticity.
We observe that S-GEAR effectively transfers the geometric associations between actions from language to visual prototypes.
arXiv Detail & Related papers (2024-07-02T14:44:01Z) - CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation [61.68049335444254]
Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments.
We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP)
With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
arXiv Detail & Related papers (2024-02-19T08:29:03Z) - JOADAA: joint online action detection and action anticipation [2.7792814152937027]
Action anticipation involves forecasting future actions by connecting past events to future ones.
Online action detection is the task of predicting actions in a streaming manner.
By combining action anticipation and online action detection, our approach can cover the missing dependencies of future information.
arXiv Detail & Related papers (2023-09-12T11:17:25Z) - Improving Vision-and-Language Navigation by Generating Future-View Image
Semantics [96.8435716885159]
Vision-and-Language Navigation (VLN) is the task that requires an agent to navigate through the environment based on natural language instructions.
We propose three proxy tasks during the agent's in-domain pre-training: Masked Panorama Modeling (MPM), Masked Trajectory Modeling (MTM), and Action Prediction with Image Generation (APIG)
We then fine-tune the agent on the VLN task with an auxiliary loss that minimizes the difference between the view semantics generated by the agent and the ground truth view semantics of the next step.
arXiv Detail & Related papers (2023-04-11T00:36:02Z) - SimOn: A Simple Framework for Online Temporal Action Localization [51.27476730635852]
We propose a framework, termed SimOn, that learns to predict action instances using the popular Transformer architecture.
Experimental results on the THUMOS14 and ActivityNet1.3 datasets show that our model remarkably outperforms the previous methods.
arXiv Detail & Related papers (2022-11-08T04:50:54Z) - End-to-end Temporal Action Detection with Transformer [86.80289146697788]
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.
Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR.
Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
arXiv Detail & Related papers (2021-06-18T17:58:34Z) - Look Wide and Interpret Twice: Improving Performance on Interactive
Instruction-following Tasks [29.671268927569063]
Recent studies have tackled the problem using ALFRED, a well-designed dataset for the task.
This paper proposes a new method, which outperforms the previous methods by a large margin.
arXiv Detail & Related papers (2021-06-01T16:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.