Related papers: PRACT: Optimizing Principled Reasoning and Acting of LLM Agent

PRACT: Optimizing Principled Reasoning and Acting of LLM Agent

URL: http://arxiv.org/abs/2410.18528v1
Date: Thu, 24 Oct 2024 08:21:51 GMT
Title: PRACT: Optimizing Principled Reasoning and Acting of LLM Agent
Authors: Zhiwei Liu, Weiran Yao, Jianguo Zhang, Rithesh Murthy, Liangwei Yang, Zuxin Liu, Tian Lan, Ming Zhu, Juntao Tan, Shirley Kokane, Thai Hoang, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong,
Abstract summary: We introduce the Principled Reasoning and Acting (PRAct) framework, a novel method for learning and enforcing action principles from trajectory data. We propose a new optimization framework, Reflective Principle Optimization (RPO), to adapt action principles to specific task requirements. Experimental results across four environments demonstrate that the PRAct agent, leveraging the RPO framework, effectively learns and applies action principles to enhance performance.
Score: 96.10771520261596
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce the Principled Reasoning and Acting (PRAct) framework, a novel method for learning and enforcing action principles from trajectory data. Central to our approach is the use of text gradients from a reflection and optimization engine to derive these action principles. To adapt action principles to specific task requirements, we propose a new optimization framework, Reflective Principle Optimization (RPO). After execution, RPO employs a reflector to critique current action principles and an optimizer to update them accordingly. We develop the RPO framework under two scenarios: Reward-RPO, which uses environmental rewards for reflection, and Self-RPO, which conducts self-reflection without external rewards. Additionally, two RPO methods, RPO-Traj and RPO-Batch, is introduced to adapt to different settings. Experimental results across four environments demonstrate that the PRAct agent, leveraging the RPO framework, effectively learns and applies action principles to enhance performance.

Related papers

Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers [55.33468902405567]
We propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback.<n>ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.
arXiv Detail & Related papers (2026-02-09T03:42:16Z)
Agentic Policy Optimization via Instruction-Policy Co-Evolution [44.74237684380034]
INSPO is a novel framework for instruction-policy co-evolution.<n>It integrates instruction optimization as a dynamic component of the reinforcement learning loop.<n>In experiments, INSPO achieves substantial performance gains with only a marginal increase in computational overhead.
arXiv Detail & Related papers (2025-12-01T17:56:29Z)
Bootstrapping LLMs via Preference-Based Policy Optimization [11.796630967998544]
bootstrapping large language models (LLMs) through preference-based policy optimization offers a promising direction for aligning model behavior with human preferences.<n>We propose a novel preference-based policy optimization framework that formulates the learning process as a min-max game between the main policy and a reward model.<n>Our approach consistently outperforms existing state-of-the-art preference optimization techniques.
arXiv Detail & Related papers (2025-11-17T01:41:14Z)
RecLLM-R1: A Two-Stage Training Paradigm with Reinforcement Learning and Chain-of-Thought v1 [20.92548890511589]
This paper introduces RecLLM-R1, a novel recommendation framework leveraging Large Language Models (LLMs)<n> RecLLM-R1 significantly surpasses existing baseline methods across a spectrum of evaluation metrics, including accuracy, diversity, and novelty.
arXiv Detail & Related papers (2025-06-24T01:39:34Z)
Value-Free Policy Optimization via Reward Partitioning [0.08192907805418585]
We introduce Reward Partitioning Optimization (RPO), a new method for single-trajectory reinforcement learning.<n>RPO normalizes observed rewards using a approach estimated directly from data.<n>We validate RPO on scalar-feedback language modeling tasks using Flan-T5 encoder-decoder models.
arXiv Detail & Related papers (2025-06-16T17:06:27Z)
Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment [45.45508377432791]
This paper introduces Reward-Aware Preference Optimization (RPO), a mathematical framework that unifies popular preference optimization techniques. RPO provides a structured approach to disentangle and systematically study the impact of various design choices. We propose a new experimental setup that enables the clean and direct ablation of such design choices.
arXiv Detail & Related papers (2025-01-31T22:39:04Z)
Large Language Model driven Policy Exploration for Recommender Systems [50.70228564385797]
offline RL policies trained on static user data are vulnerable to distribution shift when deployed in dynamic online environments. Online RL-based RS also face challenges in production deployment due to the risks of exposing users to untrained or unstable policies. Large Language Models (LLMs) offer a promising solution to mimic user objectives and preferences for pre-training policies offline. We propose an Interaction-Augmented Learned Policy (iALP) that utilizes user preferences distilled from an LLM.
arXiv Detail & Related papers (2025-01-23T16:37:44Z)
Reflective Policy Optimization [20.228281670899204]
Reflective Policy Optimization (RPO) amalgamates past and future state-action information for policy optimization. RPO empowers the agent for introspection, allowing modifications to its actions within the current state. Empirical results demonstrate RPO's feasibility and efficacy in two reinforcement learning benchmarks.
arXiv Detail & Related papers (2024-06-06T01:46:49Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL. We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
DPO: Differential reinforcement learning with application to optimal configuration search [3.2857981869020327]
Reinforcement learning with continuous state and action spaces remains one of the most challenging problems within the field. We propose the first differential RL framework that can handle settings with limited training samples and short-length episodes.
arXiv Detail & Related papers (2024-04-24T03:11:12Z)
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF) We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z)
Counterfactual Explanation Policies in RL [3.674863913115432]
COUNTERPOL is the first framework to analyze Reinforcement Learning policies using counterfactual explanations. We establish a theoretical connection between Counterpol and widely used trust region-based policy optimization methods in RL.
arXiv Detail & Related papers (2023-07-25T01:14:56Z)
Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces. We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories. We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z)
Diverse Policy Optimization for Structured Action Space [59.361076277997704]
We propose Diverse Policy Optimization (DPO) to model the policies in structured action space as the energy-based models (EBM) A novel and powerful generative model, GFlowNet, is introduced as the efficient, diverse EBM-based policy sampler. Experiments on ATSC and Battle benchmarks demonstrate that DPO can efficiently discover surprisingly diverse policies.
arXiv Detail & Related papers (2023-02-23T10:48:09Z)
REPTILE: A Proactive Real-Time Deep Reinforcement Learning Self-adaptive Framework [0.6335848702857039]
A general framework is proposed to support the development of software systems that are able to adapt their behaviour according to the operating environment changes. The proposed approach, named REPTILE, works in a complete proactive manner and relies on Deep Reinforcement Learning-based agents to react to events. In our framework, two types of novelties are taken into account: those related to the context/environment and those related to the physical architecture itself. The framework, predicting those novelties before their occurrence, extracts time-changing models of the environment and uses a suitable Markov Decision Process to deal with the real-time setting.
arXiv Detail & Related papers (2022-03-28T12:38:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.