Related papers: M3PO: Massively Multi-Task Model-Based Policy Optimization

M3PO: Massively Multi-Task Model-Based Policy Optimization

URL: http://arxiv.org/abs/2506.21782v1
Date: Thu, 26 Jun 2025 21:39:01 GMT
Title: M3PO: Massively Multi-Task Model-Based Policy Optimization
Authors: Aditya Narendra, Dmitry Makarov, Aleksandr Panov,
Abstract summary: We introduce Massively Multi-Task Model-Based Policy Optimization (M3PO), a scalable model-based reinforcement learning framework.<n>M3PO addresses sample inefficiency in single-task settings and poor generalization in multi-task domains.<n>M3PO integrates an implicit world model, trained to predict task outcomes without observation reconstruction, with a hybrid exploration strategy.
Score: 46.42871544295734
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Massively Multi-Task Model-Based Policy Optimization (M3PO), a scalable model-based reinforcement learning (MBRL) framework designed to address sample inefficiency in single-task settings and poor generalization in multi-task domains. Existing model-based approaches like DreamerV3 rely on pixel-level generative models that neglect control-centric representations, while model-free methods such as PPO suffer from high sample complexity and weak exploration. M3PO integrates an implicit world model, trained to predict task outcomes without observation reconstruction, with a hybrid exploration strategy that combines model-based planning and model-free uncertainty-driven bonuses. This eliminates the bias-variance trade-off in prior methods by using discrepancies between model-based and model-free value estimates to guide exploration, while maintaining stable policy updates through a trust-region optimizer. M3PO provides an efficient and robust alternative to existing model-based policy optimization approaches and achieves state-of-the-art performance across multiple benchmarks.

Related papers

Decision Flow Policy Optimization [53.825268058199825]
We show that generative models can effectively model complex multi-modal action distributions and achieve superior robotic control in continuous action spaces.<n>Previous methods usually adopt the generative models as behavior models to fit state-conditioned action distributions from datasets.<n>We propose Decision Flow, a unified framework that integrates multi-modal action distribution modeling and policy optimization.
arXiv Detail & Related papers (2025-05-26T03:42:20Z)
PPO-MI: Efficient Black-Box Model Inversion via Proximal Policy Optimization [0.0]
Model inversion attacks pose a significant privacy risk by attempting to reconstruct private training data from trained models.<n>We propose PPO-MI, a novel reinforcement learning-based framework for black-box model inversion attacks.<n>Our approach formulates the inversion task as a Markov Decision Process, where an agent navigates the latent space of a generative model to reconstruct private training samples.
arXiv Detail & Related papers (2025-02-20T08:57:45Z)
When to Update Your Model: Constrained Model-based Reinforcement Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL) Our follow-up derived bounds reveal the relationship between model shifts and performance improvement. A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z)
Fully Decentralized Model-based Policy Optimization for Networked Systems [23.46407780093797]
This work aims to improve data efficiency of multi-agent control by model-based learning. We consider networked systems where agents are cooperative and communicate only locally with their neighbors. In our method, each agent learns a dynamic model to predict future states and broadcast their predictions by communication, and then the policies are trained under the model rollouts.
arXiv Detail & Related papers (2022-07-13T23:52:14Z)
Evaluating model-based planning and planner amortization for continuous control [79.49319308600228]
We take a hybrid approach, combining model predictive control (MPC) with a learned model and model-free policy learning. We find that well-tuned model-free agents are strong baselines even for high DoF control problems. We show that it is possible to distil a model-based planner into a policy that amortizes the planning without any loss of performance.
arXiv Detail & Related papers (2021-10-07T12:00:40Z)
On the model-based stochastic value gradient for continuous reinforcement learning [50.085645237597056]
We show that simple model-based agents can outperform state-of-the-art model-free agents in terms of both sample-efficiency and final reward. Our findings suggest that model-based policy evaluation deserves closer attention.
arXiv Detail & Related papers (2020-08-28T17:58:29Z)
Control as Hybrid Inference [62.997667081978825]
We present an implementation of CHI which naturally mediates the balance between iterative and amortised inference. We verify the scalability of our algorithm on a continuous control benchmark, demonstrating that it outperforms strong model-free and model-based baselines.
arXiv Detail & Related papers (2020-07-11T19:44:09Z)
Variational Model-based Policy Optimization [34.80171122943031]
Model-based reinforcement learning (RL) algorithms allow us to combine model-generated data with those collected from interaction with the real system in order to alleviate the data efficiency problem in RL. We propose an objective function as a variational lower-bound of a log-likelihood of a log-likelihood to jointly learn and improve model and policy. Our experiments on a number of continuous control tasks show that despite being more complex, our model-based (E-step) algorithm, called emactoral model-based policy optimization (VMBPO), is more sample-efficient and
arXiv Detail & Related papers (2020-06-09T18:30:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.