Scaling Synthetic Task Generation for Agents via Exploration
- URL: http://arxiv.org/abs/2509.25047v1
- Date: Mon, 29 Sep 2025 17:00:02 GMT
- Title: Scaling Synthetic Task Generation for Agents via Exploration
- Authors: Ram Ramrakhya, Andrew Szot, Omar Attia, Yuhao Yang, Anh Nguyen, Bogdan Mazoure, Zhe Gan, Harsh Agrawal, Alexander Toshev,
- Abstract summary: Post-Training Multimodal Large Language Models (MLLMs) to build interactive agents holds promise across domains such as computer-use, web navigation, and robotics.<n>Existing approaches for task generation rely heavily on human annotation or prompting MLLM with limited downstream environment information.<n>We present AutoPlay, a scalable pipeline for task generation that explicitly explores interactive environments to synthesize environment-grounded tasks.
- Score: 67.70129766322985
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Post-Training Multimodal Large Language Models (MLLMs) to build interactive agents holds promise across domains such as computer-use, web navigation, and robotics. A key challenge in scaling such post-training is lack of high-quality downstream agentic task datasets with tasks that are diverse, feasible, and verifiable. Existing approaches for task generation rely heavily on human annotation or prompting MLLM with limited downstream environment information, which is either costly or poorly scalable as it yield tasks with limited coverage. To remedy this, we present AutoPlay, a scalable pipeline for task generation that explicitly explores interactive environments to discover possible interactions and current state information to synthesize environment-grounded tasks. AutoPlay operates in two stages: (i) an exploration phase, where an MLLM explorer agent systematically uncovers novel environment states and functionalities, and (ii) a task generation phase, where a task generator leverages exploration trajectories and a set of task guideline prompts as context to synthesize diverse, executable, and verifiable tasks. We show AutoPlay generates 20k tasks across 20 Android applications and 10k tasks across 13 applications Ubuntu applications to train mobile-use and computer-use agents. AutoPlay generated tasks enable large-scale task demonstration synthesis without human annotation by employing an MLLM task executor and verifier. This data enables training MLLM-based UI agents that improve success rates up to $20.0\%$ on mobile-use and $10.9\%$ on computer-use scenarios. In addition, AutoPlay generated tasks combined with MLLM verifier-based rewards enable scaling reinforcement learning training of UI agents, leading to an additional $5.7\%$ gain. coverage. These results establish AutoPlay as a scalable approach for post-training capable MLLM agents reducing reliance on human annotation.
Related papers
- MapAgent: Trajectory-Constructed Memory-Augmented Planning for Mobile Task Automation [5.433829353194621]
MapAgent is a framework that leverages memory constructed from historical trajectories to augment current task planning.<n>We introduce a coarse-to-fine task planning approach that retrieves relevant pages from the memory database based on similarity.<n>Results in real-world scenarios demonstrate that MapAgent achieves superior performance to existing methods.
arXiv Detail & Related papers (2025-07-29T16:05:32Z) - Exploring GPT-4 for Robotic Agent Strategy with Real-Time State Feedback and a Reactive Behaviour Framework [0.0]
We explore the use of GPT-4 on a humanoid robot in simulation and the real world as proof of concept of a novel large language model (LLM) driven behaviour method.<n>The problem involves prompting the LLM with a goal, and the LLM outputs the sub-tasks to complete to achieve that goal.<n>We propose a method that successfully addresses practical concerns around safety, transitions between tasks, time horizons of tasks and state feedback.
arXiv Detail & Related papers (2025-03-30T21:53:28Z) - Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z) - Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering.
Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications.
These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z) - CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only [21.054681757006385]
We propose an agent that perceives its environment solely through screenshot images.<n>By leveraging the reasoning capability of the Large Language Models, we eliminate the need for large-scale human demonstration data.<n>Agent achieves an average success rate of 94.5% on MiniWoB++ and an average task score of 62.3 on WebShop.
arXiv Detail & Related papers (2024-06-11T05:21:20Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation.
Specifically, task decomposition, tool selection, and parameter prediction are assessed.
Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z) - Generalizable Long-Horizon Manipulations with Large Language Models [91.740084601715]
This work introduces a framework harnessing the capabilities of Large Language Models (LLMs) to generate primitive task conditions for generalizable long-horizon manipulations.
We create a challenging robotic manipulation task suite based on Pybullet for long-horizon task evaluation.
arXiv Detail & Related papers (2023-10-03T17:59:46Z) - GenSim: Generating Robotic Simulation Tasks via Large Language Models [34.79613485106202]
GenSim aims to automatically generate rich simulation environments and expert demonstrations.
We use GPT4 to expand the existing benchmark by ten times to over 100 tasks.
With minimal sim-to-real adaptation, multitask policies pretrained on GPT4-generated simulation tasks exhibit stronger transfer to unseen long-horizon tasks in the real world.
arXiv Detail & Related papers (2023-10-02T17:23:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.