Related papers: Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning

Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning

URL: http://arxiv.org/abs/2502.13430v1
Date: Wed, 19 Feb 2025 05:04:10 GMT
Title: Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning
Authors: Hao Ma, Shijie Wang, Zhiqiang Pu, Siyao Zhao, Xiaolin Ai,
Abstract summary: We propose a hierarchical vision-based reward shaping method to guide the policy of reinforcement learning to align with human common sense.<n>To help the policy adapt to uncertainty and changes in long-horizon tasks, the top layer features an adaptive skill selection module.<n>Our method achieves a higher win rate and effectively aligns the policy with human common sense.
Score: 14.68673479535835
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Guiding the policy of multi-agent reinforcement learning to align with human common sense is a difficult problem, largely due to the complexity of modeling common sense as a reward, especially in complex and long-horizon multi-agent tasks. Recent works have shown the effectiveness of reward shaping, such as potential-based rewards, to enhance policy alignment. The existing works, however, primarily rely on experts to design rule-based rewards, which are often labor-intensive and lack a high-level semantic understanding of common sense. To solve this problem, we propose a hierarchical vision-based reward shaping method. At the bottom layer, a visual-language model (VLM) serves as a generic potential function, guiding the policy to align with human common sense through its intrinsic semantic understanding. To help the policy adapts to uncertainty and changes in long-horizon tasks, the top layer features an adaptive skill selection module based on a visual large language model (vLLM). The module uses instructions, video replays, and training records to dynamically select suitable potential function from a pre-designed pool. Besides, our method is theoretically proven to preserve the optimal policy. Extensive experiments conducted in the Google Research Football environment demonstrate that our method not only achieves a higher win rate but also effectively aligns the policy with human common sense.

Related papers

From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process.<n>We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z)
Efficient Adaptation in Mixed-Motive Environments via Hierarchical Opponent Modeling and Planning [51.52387511006586]
We propose Hierarchical Opponent modeling and Planning (HOP), a novel multi-agent decision-making algorithm. HOP is hierarchically composed of two modules: an opponent modeling module that infers others' goals and learns corresponding goal-conditioned policies. HOP exhibits superior few-shot adaptation capabilities when interacting with various unseen agents, and excels in self-play scenarios.
arXiv Detail & Related papers (2024-06-12T08:48:06Z)
Foundational Policy Acquisition via Multitask Learning for Motor Skill Generation [0.9668407688201356]
We propose a multitask reinforcement learning algorithm for foundational policy acquisition to generate novel motor skills. Inspired by human sensorimotor adaptation mechanisms, we developed the learning pipeline to construct the encoder-decoder networks and network selection.
arXiv Detail & Related papers (2023-08-31T05:26:14Z)
Efficient Domain Coverage for Vehicles with Second-Order Dynamics via Multi-Agent Reinforcement Learning [9.939081691797858]
We present a reinforcement learning (RL) approach for the multi-agent efficient domain coverage problem involving agents with second-order dynamics. Our proposed network architecture includes the incorporation of LSTM and self-attention, which allows the trained policy to adapt to a variable number of agents.
arXiv Detail & Related papers (2022-11-11T01:59:12Z)
Option-Aware Adversarial Inverse Reinforcement Learning for Robotic Control [44.77500987121531]
Hierarchical Imitation Learning (HIL) has been proposed to recover highly-complex behaviors in long-horizon tasks from expert demonstrations. We develop a novel HIL algorithm based on Adversarial Inverse Reinforcement Learning. We also propose a Variational Autoencoder framework for learning with our objectives in an end-to-end fashion.
arXiv Detail & Related papers (2022-10-05T00:28:26Z)
Constructing a Good Behavior Basis for Transfer using Generalized Policy Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks. We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z)
Goal-Conditioned Reinforcement Learning with Imagined Subgoals [89.67840168694259]
We propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks. Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the policy and its critic. We evaluate our approach on complex robotic navigation and manipulation tasks and show that it outperforms existing methods by a large margin.
arXiv Detail & Related papers (2021-07-01T15:30:59Z)
Generative Adversarial Reward Learning for Generalized Behavior Tendency Inference [71.11416263370823]
We propose a generative inverse reinforcement learning for user behavioral preference modelling. Our model can automatically learn the rewards from user's actions based on discriminative actor-critic network and Wasserstein GAN.
arXiv Detail & Related papers (2021-05-03T13:14:25Z)
Learn Goal-Conditioned Policy with Intrinsic Motivation for Deep Reinforcement Learning [9.014110264448371]
We propose a novel unsupervised learning approach named goal-conditioned policy with intrinsic motivation (GPIM) GPIM jointly learns both an abstract-level policy and a goal-conditioned policy. Experiments on various robotic tasks demonstrate the effectiveness and efficiency of our proposed GPIM method.
arXiv Detail & Related papers (2021-04-11T16:26:10Z)
Policy Supervectors: General Characterization of Agents by their Behaviour [18.488655590845163]
We propose policy supervectors for characterizing agents by the distribution of states they visit. Policy supervectors can characterize policies regardless of their design philosophy and scale to thousands of policies on a single workstation machine. We demonstrate method's applicability by studying the evolution of policies during reinforcement learning, evolutionary training and imitation learning.
arXiv Detail & Related papers (2020-12-02T14:43:16Z)
Continuous Action Reinforcement Learning from a Mixture of Interpretable Experts [35.80418547105711]
We propose a policy scheme that retains a complex function approxor for its internal value predictions but constrains the policy to have a concise, hierarchical, and human-readable structure. The main technical contribution of the paper is to address the challenges introduced by this non-differentiable state selection procedure.
arXiv Detail & Related papers (2020-06-10T16:02:08Z)
Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data. Can we learn effective policies via supervised learning without demonstrations? We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.