Related papers: Personalisation via Dynamic Policy Fusion

Personalisation via Dynamic Policy Fusion

URL: http://arxiv.org/abs/2409.20016v2
Date: Thu, 3 Oct 2024 03:15:28 GMT
Title: Personalisation via Dynamic Policy Fusion
Authors: Ajsal Shereef Palattuparambil, Thommen George Karimpanal, Santu Rana,
Abstract summary: Deep reinforcement learning policies may not align with the personal preferences of human users. We propose a more practical approach - to adapt the already trained policy to user-specific needs with the help of human feedback. We empirically demonstrate in a number of environments that our proposed dynamic policy fusion approach consistently achieves the intended task.
Score: 14.948610521764415
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Deep reinforcement learning (RL) policies, although optimal in terms of task rewards, may not align with the personal preferences of human users. To ensure this alignment, a naive solution would be to retrain the agent using a reward function that encodes the user's specific preferences. However, such a reward function is typically not readily available, and as such, retraining the agent from scratch can be prohibitively expensive. We propose a more practical approach - to adapt the already trained policy to user-specific needs with the help of human feedback. To this end, we infer the user's intent through trajectory-level feedback and combine it with the trained task policy via a theoretically grounded dynamic policy fusion approach. As our approach collects human feedback on the very same trajectories used to learn the task policy, it does not require any additional interactions with the environment, making it a zero-shot approach. We empirically demonstrate in a number of environments that our proposed dynamic policy fusion approach consistently achieves the intended task while simultaneously adhering to user-specific needs.

Related papers

Steering Robots with Inference-Time Interactions [0.5801621787540268]
When a pretrained policy makes errors during deployment, there are limited mechanisms for users to correct its behavior.<n>My research proposes an alternative: keeping pretrained policies frozen as a fixed skill repertoire while allowing user interactions to guide behavior generation at inference time.<n>Specifically, I propose (1) inference-time steering, which leverages user interactions to switch between discrete skills, and (2) task and motion imitation, which enables user interactions to edit continuous motions while satisfying task constraints defined by discrete symbolic plans.
arXiv Detail & Related papers (2025-06-17T07:59:07Z)
Seldonian Reinforcement Learning for Ad Hoc Teamwork [47.100080234094065]
Most offline RL algorithms return optimal policies but do not provide statistical guarantees on undesirable behaviors. In this work, we propose a novel offline RL approach, inspired by Seldonian optimization. Our focus is on Ad Hoc Teamwork settings, where agents must collaborate with new teammates without prior coordination.
arXiv Detail & Related papers (2025-03-05T20:37:02Z)
Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning [14.68673479535835]
We propose a hierarchical vision-based reward shaping method to guide the policy of reinforcement learning to align with human common sense. To help the policy adapt to uncertainty and changes in long-horizon tasks, the top layer features an adaptive skill selection module. Our method achieves a higher win rate and effectively aligns the policy with human common sense.
arXiv Detail & Related papers (2025-02-19T05:04:10Z)
FDPP: Fine-tune Diffusion Policy with Human Preference [57.44575105114056]
Fine-tuning Diffusion Policy with Human Preference learns a reward function through preference-based learning. This reward is then used to fine-tune the pre-trained policy with reinforcement learning. Experiments demonstrate that FDPP effectively customizes policy behavior without compromising performance.
arXiv Detail & Related papers (2025-01-14T17:15:27Z)
Learning Control Policies for Variable Objectives from Offline Data [2.7174376960271154]
We introduce a conceptual extension for model-based policy search methods, called variable objective policy (VOP) We demonstrate that by altering the objectives passed as input to the policy, users gain the freedom to adjust its behavior or re-balance optimization targets at runtime.
arXiv Detail & Related papers (2023-08-11T13:33:59Z)
Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces. We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories. We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z)
Residual Q-Learning: Offline and Online Policy Customization without Value [53.47311900133564]
Imitation Learning (IL) is a widely used framework for learning imitative behavior from demonstrations. We formulate a new problem setting called policy customization. We propose a novel framework, Residual Q-learning, which can solve the formulated MDP by leveraging the prior policy.
arXiv Detail & Related papers (2023-06-15T22:01:19Z)
To the Noise and Back: Diffusion for Shared Autonomy [2.341116149201203]
We present a new approach to shared autonomy that employs a modulation of the forward and reverse diffusion process of diffusion models. Our framework learns a distribution over a space of desired behaviors. It then employs a diffusion model to translate the user's actions to a sample from this distribution.
arXiv Detail & Related papers (2023-02-23T18:58:36Z)
Eliciting User Preferences for Personalized Multi-Objective Decision Making through Comparative Feedback [76.7007545844273]
We propose a multi-objective decision making framework that accommodates different user preferences over objectives. Our model consists of a Markov decision process with a vector-valued reward function, with each user having an unknown preference vector. We suggest an algorithm that finds a nearly optimal policy for the user using a small number of comparison queries.
arXiv Detail & Related papers (2023-02-07T23:58:19Z)
Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in Latent Space [76.46113138484947]
General-purpose robots require diverse repertoires of behaviors to complete challenging tasks in real-world unstructured environments. To address this issue, goal-conditioned reinforcement learning aims to acquire policies that can reach goals for a wide range of tasks on command. We propose Planning to Practice, a method that makes it practical to train goal-conditioned policies for long-horizon tasks.
arXiv Detail & Related papers (2022-05-17T06:58:17Z)
Fast Model-based Policy Search for Universal Policy Networks [45.44896435487879]
Adapting an agent's behaviour to new environments has been one of the primary focus areas of physics based reinforcement learning. We propose a Gaussian Process-based prior learned in simulation, that captures the likely performance of a policy when transferred to a previously unseen environment. We integrate this prior with a Bayesian optimisation-based policy search process to improve the efficiency of identifying the most appropriate policy from the universal policy network.
arXiv Detail & Related papers (2022-02-11T18:08:02Z)
Goal-Conditioned Reinforcement Learning with Imagined Subgoals [89.67840168694259]
We propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks. Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the policy and its critic. We evaluate our approach on complex robotic navigation and manipulation tasks and show that it outperforms existing methods by a large margin.
arXiv Detail & Related papers (2021-07-01T15:30:59Z)
Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions. We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions. We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z)
Generative Inverse Deep Reinforcement Learning for Online Recommendation [62.09946317831129]
We propose a novel inverse reinforcement learning approach, namely InvRec, for online recommendation. InvRec extracts the reward function from user's behaviors automatically, for online recommendation.
arXiv Detail & Related papers (2020-11-04T12:12:25Z)
First Order Constrained Optimization in Policy Space [19.00289722198614]
We propose a novel approach called First Order Constrained Optimization in Policy Space (FOCOPS) FOCOPS maximizes an agent's overall reward while ensuring the agent satisfies a set of cost constraints. We provide empirical evidence that our simple approach achieves better performance on a set of constrained robotics locomotive tasks.
arXiv Detail & Related papers (2020-02-16T05:07:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.