Optimizing Latent Goal by Learning from Trajectory Preference
- URL: http://arxiv.org/abs/2412.02125v1
- Date: Tue, 03 Dec 2024 03:27:48 GMT
- Title: Optimizing Latent Goal by Learning from Trajectory Preference
- Authors: Guangyu Zhao, Kewei Lian, Haowei Lin, Haobo Fu, Qiang Fu, Shaofei Cai, Zihao Wang, Yitao Liang,
- Abstract summary: We propose a framework named Preference Goal Tuning (PGT)
PGT allows an instruction following policy to interact with the environment to collect several trajectories.
We use preference learning to fine-tune the initial goal latent representation with the categorized trajectories.
- Score: 18.262362315783268
- License:
- Abstract: A glowing body of work has emerged focusing on instruction-following policies for open-world agents, aiming to better align the agent's behavior with human intentions. However, the performance of these policies is highly susceptible to the initial prompt, which leads to extra efforts in selecting the best instructions. We propose a framework named Preference Goal Tuning (PGT). PGT allows an instruction following policy to interact with the environment to collect several trajectories, which will be categorized into positive and negative samples based on preference. Then we use preference learning to fine-tune the initial goal latent representation with the categorized trajectories while keeping the policy backbone frozen. The experiment result shows that with minimal data and training, PGT achieves an average relative improvement of 72.0% and 81.6% over 17 tasks in 2 different foundation policies respectively, and outperforms the best human-selected instructions. Moreover, PGT surpasses full fine-tuning in the out-of-distribution (OOD) task-execution environments by 13.4%, indicating that our approach retains strong generalization capabilities. Since our approach stores a single latent representation for each task independently, it can be viewed as an efficient method for continual learning, without the risk of catastrophic forgetting or task interference. In short, PGT enhances the performance of agents across nearly all tasks in the Minecraft Skillforge benchmark and demonstrates robustness to the execution environment.
Related papers
- FDPP: Fine-tune Diffusion Policy with Human Preference [57.44575105114056]
Fine-tuning Diffusion Policy with Human Preference learns a reward function through preference-based learning.
This reward is then used to fine-tune the pre-trained policy with reinforcement learning.
Experiments demonstrate that FDPP effectively customizes policy behavior without compromising performance.
arXiv Detail & Related papers (2025-01-14T17:15:27Z) - From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process.
We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs.
In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective.
We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z) - Optimistic Multi-Agent Policy Gradient [23.781837938235036]
Relative overgeneralization (RO) occurs when agents converge towards a suboptimal joint policy.
No methods have been proposed for addressing RO in multi-agent policy gradient (MAPG) methods.
We propose a general, yet simple, framework to enable optimistic updates in MAPG methods that alleviate the RO problem.
arXiv Detail & Related papers (2023-11-03T14:47:54Z) - Robust Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
In continuous action domains, parameterized distribution of action distribution allows easy control of exploration.
In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution.
We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym.
arXiv Detail & Related papers (2022-12-14T22:43:56Z) - APS: Active Pretraining with Successor Features [96.24533716878055]
We show that by reinterpreting and combining successorcitepHansenFast with non entropy, the intractable mutual information can be efficiently optimized.
The proposed method Active Pretraining with Successor Feature (APS) explores the environment via non entropy, and the explored data can be efficiently leveraged to learn behavior.
arXiv Detail & Related papers (2021-08-31T16:30:35Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - On Proximal Policy Optimization's Heavy-tailed Gradients [150.08522793940708]
We study the heavy-tailed nature of the gradients of the Proximal Policy Optimization surrogate reward function.
In this paper, we study the effects of the standard PPO clippings, demonstrating that these tricks primarily serve to offset heavy-tailedness in gradients.
We propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks.
arXiv Detail & Related papers (2021-02-20T05:51:28Z) - Off-Policy Deep Reinforcement Learning with Analogous Disentangled
Exploration [33.25932244741268]
Off-policy reinforcement learning (RL) is concerned with learning a rewarding policy by executing another policy that gathers samples of experience.
While the former policy is rewarding but in-expressive (in most cases, deterministic), doing well in the latter task, in contrast, requires an expressive policy that offers guided and effective exploration.
We propose Analogous Disentangled Actor-Critic (ADAC) to mitigate this problem.
arXiv Detail & Related papers (2020-02-25T08:49:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.