Related papers: Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents

Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents

URL: http://arxiv.org/abs/2507.23698v1
Date: Thu, 31 Jul 2025 16:20:02 GMT
Title: Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents
Authors: Shaofei Cai, Zhancun Mu, Haiwen Xia, Bowei Zhang, Anji Liu, Yitao Liang,
Abstract summary: We show that RL-finetuned visuomotor agents in Minecraft can achieve zero-shot generalization to unseen worlds.<n>We propose automated task synthesis within the highly customizable Minecraft environment for large-scale multi-task RL training.
Score: 12.945269075811112
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Reinforcement Learning (RL) has achieved remarkable success in language modeling, its triumph hasn't yet fully translated to visuomotor agents. A primary challenge in RL models is their tendency to overfit specific tasks or environments, thereby hindering the acquisition of generalizable behaviors across diverse settings. This paper provides a preliminary answer to this challenge by demonstrating that RL-finetuned visuomotor agents in Minecraft can achieve zero-shot generalization to unseen worlds. Specifically, we explore RL's potential to enhance generalizable spatial reasoning and interaction capabilities in 3D worlds. To address challenges in multi-task RL representation, we analyze and establish cross-view goal specification as a unified multi-task goal space for visuomotor policies. Furthermore, to overcome the significant bottleneck of manual task design, we propose automated task synthesis within the highly customizable Minecraft environment for large-scale multi-task RL training, and we construct an efficient distributed RL framework to support this. Experimental results show RL significantly boosts interaction success rates by $4\times$ and enables zero-shot generalization of spatial reasoning across diverse environments, including real-world settings. Our findings underscore the immense potential of RL training in 3D simulated environments, especially those amenable to large-scale task generation, for significantly advancing visuomotor agents' spatial reasoning.

Related papers

Bigger, Regularized, Categorical: High-Capacity Value Functions are Efficient Multi-Task Learners [60.75160178669076]
We show that the use of high-capacity value models trained via cross-entropy and conditioned on learnable task embeddings addresses the problem of task interference in online reinforcement learning.<n>We test our approach on 7 multi-task benchmarks with over 280 unique tasks, spanning high degree-of-freedom humanoid control and discrete vision-based RL.
arXiv Detail & Related papers (2025-05-29T06:41:45Z)
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning [125.96848846966087]
Training large language models (LLMs) as interactive agents presents unique challenges.<n>While reinforcement learning has enabled progress in static tasks, multi-turn agent RL training remains underexplored.<n>We propose StarPO, a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents.
arXiv Detail & Related papers (2025-04-24T17:57:08Z)
Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds [35.652208216209985]
In-Context Reinforcement Learning (ICRL) enables agents to learn automatically and on-the-fly from their interactive experiences.<n>We propose the procedurally generated Markov Decision Processes, named AnyMDP.<n>Our results demonstrate that, with a sufficiently large scale of AnyMDP tasks, the proposed model can generalize to tasks that were not considered in the training set.
arXiv Detail & Related papers (2025-02-05T03:59:13Z)
GenRL: Multimodal-foundation world models for generalization in embodied agents [12.263162194821787]
Reinforcement learning (RL) is hard to scale up as it requires a complex reward design for each task. Current foundation vision-language models (VLMs) require fine-tuning or other adaptations to be adopted in embodied contexts. Lack of multimodal data in such domains represents an obstacle to developing foundation models for embodied applications.
arXiv Detail & Related papers (2024-06-26T03:41:48Z)
Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks [53.44714413181162]
This paper shows that when an agent is trained on a sufficiently diverse set of tasks, a generic policy-sharing algorithm with myopic exploration design can be sample-efficient. To the best of our knowledge, this is the first theoretical demonstration of the "exploration benefits" of MTRL.
arXiv Detail & Related papers (2024-03-03T22:57:44Z)
Adaptive action supervision in reinforcement learning from real-world multi-agent demonstrations [10.174009792409928]
We propose a method for adaptive action supervision in RL from real-world demonstrations in multi-agent scenarios. In the experiments, using chase-and-escape and football tasks with the different dynamics between the unknown source and target environments, we show that our approach achieved a balance between the generalization and the generalization ability compared with the baselines.
arXiv Detail & Related papers (2023-05-22T13:33:37Z)
Human-Timescale Adaptation in an Open-Ended Task Space [56.55530165036327]
We show that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. Our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains.
arXiv Detail & Related papers (2023-01-18T15:39:21Z)
Avalon: A Benchmark for RL Generalization Using Procedurally Generated Worlds [0.0]
Avalon is a set of tasks in which embodied agents in procedural 3D worlds must survive by navigating terrain, hunting or gathering food, and avoiding hazards. Avalon is unique among existing RL benchmarks in that the reward function, world dynamics, and action space are the same for every task. Standard RL baselines make progress on most tasks but are still far from human performance, suggesting Avalon is challenging enough to advance the quest for generalizable RL.
arXiv Detail & Related papers (2022-10-24T17:34:50Z)
Learning Controllable 3D Level Generators [3.95471659767555]
We introduce several PCGRL tasks for the 3D domain, Minecraft (Mojang Studios, 2009) These tasks will challenge RL-based generators using affordances often found in 3D environments, such as jumping, multiple dimensional movement, and gravity. We train an agent to optimize each of these tasks to explore the capabilities of previous research in PCGRL.
arXiv Detail & Related papers (2022-06-27T20:43:56Z)
Multitask Adaptation by Retrospective Exploration with Learned World Models [77.34726150561087]
We propose a meta-learned addressing model called RAMa that provides training samples for the MBRL agent taken from task-agnostic storage. The model is trained to maximize the expected agent's performance by selecting promising trajectories solving prior tasks from the storage.
arXiv Detail & Related papers (2021-10-25T20:02:57Z)
Reinforcement Learning for Sparse-Reward Object-Interaction Tasks in a First-person Simulated 3D Environment [73.9469267445146]
First-person object-interaction tasks in high-fidelity, 3D, simulated environments such as the AI2Thor pose significant sample-efficiency challenges for reinforcement learning agents. We show that one can learn object-interaction tasks from scratch without supervision by learning an attentive object-model as an auxiliary task.
arXiv Detail & Related papers (2020-10-28T19:27:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.