SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control
- URL: http://arxiv.org/abs/2508.20018v1
- Date: Wed, 27 Aug 2025 16:27:19 GMT
- Title: SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control
- Authors: Quanfeng Lu, Zhantao Ma, Shuai Zhong, Jin Wang, Dahai Yu, Michael K. Ng, Ping Luo,
- Abstract summary: We introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems.<n> SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed.<n>In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions.
- Score: 38.81034547191083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally decouple different competencies, recent progress in multi-agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current LVLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents. Theoretically, we provide a stepwise safety bound, a cross-round monotonic improvement theorem, and convergence guarantees on return, ensuring robust and principled optimization. In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions. Extensive experiments demonstrate superior performance on both high-level and low-level GUI benchmarks. Beyond GUI tasks, SWIRL also demonstrates strong capability in multi-agent mathematical reasoning, underscoring its potential as a general framework for developing efficient and robust multi-agent systems.
Related papers
- CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning [67.78566256784404]
Supervised Fine-Tuning (SFT) facilitates fast adaptation, it often triggers knowledge overwriting.<n>Reinforcement Learning (RL) demonstrates an inherent resilience that shields prior interaction logic from erasure.<n>We propose a textbfContinual textbfGUI textbfLearning framework that balances efficiency and skill retention.
arXiv Detail & Related papers (2026-03-03T13:02:20Z) - AR-MOT: Autoregressive Multi-object Tracking [56.09738000988466]
We propose a novel autoregressive paradigm that formulates MOT as a sequence generation task within a large language model (LLM) framework.<n>This design enables the model to output structured results through flexible sequence construction, without requiring any task-specific heads.<n>To enhance region-level visual perception, we introduce an Object Tokenizer based on a pretrained detector.
arXiv Detail & Related papers (2026-01-05T09:17:28Z) - Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning [42.1534425503333]
CrossAgent is a unified agentic model that masters heterogeneous action spaces and autonomously selects the most effective interface for each step of a trajectory.<n>Experiments on over 800 tasks in the open-world Minecraft environment demonstrate that CrossAgent achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-12-10T14:52:29Z) - PhysiAgent: An Embodied Agent Framework in Physical World [33.821400205384144]
Vision-Language-Action (VLA) models have achieved notable success but often struggle with limited generalizations.<n>Current approaches often combine these models in rigid, sequential structures.<n>We propose an embodied agent framework, PhysiAgent, tailored to operate effectively in physical environments.
arXiv Detail & Related papers (2025-09-29T09:39:32Z) - Aime: Towards Fully-Autonomous Multi-Agent Framework [13.494469496862534]
Multi-Agent Systems (MAS) powered by Large Language Models (LLMs) are emerging as a powerful paradigm for solving complex, multifaceted problems.<n>The potential of these systems is often constrained by the prevalent plan-and-execute framework, which suffers from critical limitations.<n>This paper introduces Aime, a novel multi-agent framework designed to overcome these challenges through dynamic, reactive planning and execution.
arXiv Detail & Related papers (2025-07-16T07:38:28Z) - AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning [82.42421823672954]
AgentCPM-GUI is built for robust and efficient on-device GUI interaction.<n>Our training pipeline includes grounding-aware pre-training to enhance perception.<n>AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks.
arXiv Detail & Related papers (2025-06-02T07:30:29Z) - AppAgentX: Evolving GUI Agents as Proficient Smartphone Users [34.70342284525283]
We propose a novel evolutionary framework for GUI agents that enhances operational efficiency while retaining intelligence and flexibility.<n>Our approach incorporates a memory mechanism that records the agent's task execution history.<n> Experimental results on multiple benchmark tasks demonstrate that our approach significantly outperforms existing methods in both efficiency and accuracy.
arXiv Detail & Related papers (2025-03-04T04:34:09Z) - Cooperative Multi-Agent Planning with Adaptive Skill Synthesis [16.228784877899976]
We present a novel multi-agent architecture that integrates vision-language models (VLMs) with a dynamic skill library and structured communication for decentralized closed-loop decision-making.<n>The skill library, bootstrapped from demonstrations, evolves via planner-guided tasks to enable adaptive strategies.<n>We demonstrate its strong performance against state-of-the-art MARL baselines across both symmetric and asymmetric scenarios.
arXiv Detail & Related papers (2025-02-14T13:23:18Z) - MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation [62.854649499866774]
Large Language Models (LLMs) have demonstrated remarkable planning abilities across various domains, including robotics manipulation and navigation.<n>We propose a novel multi-agent LLM framework that distributes high-level planning and low-level control code generation across specialized LLM agents.<n>We evaluate our approach on nine RLBench tasks, including long-horizon tasks, and demonstrate its ability to solve robotics manipulation in a zero-shot setting.
arXiv Detail & Related papers (2024-11-26T17:53:44Z) - AppAgent v2: Advanced Agent for Flexible Mobile Interactions [57.98933460388985]
This work introduces a novel LLM-based multimodal agent framework for mobile devices.<n>Our agent constructs a flexible action space that enhances adaptability across various applications.<n>Our results demonstrate the framework's superior performance, confirming its effectiveness in real-world scenarios.
arXiv Detail & Related papers (2024-08-05T06:31:39Z) - UPDeT: Universal Multi-agent Reinforcement Learning via Policy
Decoupling with Transformers [108.92194081987967]
We make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single architecture to fit tasks.
Unlike previous RNN-based models, we utilize a transformer-based model to generate a flexible policy.
The proposed model, named as Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task's decision process more explainable.
arXiv Detail & Related papers (2021-01-20T07:24:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.