Related papers: Constrained Policy Optimization for Controlled Self-Learning in Conversational AI Systems

Constrained Policy Optimization for Controlled Self-Learning in Conversational AI Systems

URL: http://arxiv.org/abs/2209.08429v1
Date: Sat, 17 Sep 2022 23:44:13 GMT
Title: Constrained Policy Optimization for Controlled Self-Learning in Conversational AI Systems
Authors: Mohammad Kachuee, Sungjin Lee
Abstract summary: We introduce a scalable framework for supporting fine-grained exploration targets for individual domains via user-defined constraints. We present a novel meta-gradient learning approach that is scalable and practical to address this problem. We conduct extensive experiments using data from a real-world conversational AI on a set of realistic constraint benchmarks.
Score: 18.546197100318693
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, self-learning methods based on user satisfaction metrics and contextual bandits have shown promising results to enable consistent improvements in conversational AI systems. However, directly targeting such metrics by off-policy bandit learning objectives often increases the risk of making abrupt policy changes that break the current user experience. In this study, we introduce a scalable framework for supporting fine-grained exploration targets for individual domains via user-defined constraints. For example, we may want to ensure fewer policy deviations in business-critical domains such as shopping, while allocating more exploration budget to domains such as music. Furthermore, we present a novel meta-gradient learning approach that is scalable and practical to address this problem. The proposed method adjusts constraint violation penalty terms adaptively through a meta objective that encourages balanced constraint satisfaction across domains. We conduct extensive experiments using data from a real-world conversational AI on a set of realistic constraint benchmarks. Based on the experimental results, we demonstrate that the proposed approach is capable of achieving the best balance between the policy value and constraint satisfaction rate.

Related papers

Reinforcement Learning with Continuous Actions Under Unmeasured Confounding [14.510042451844766]
This paper addresses the challenge of offline policy learning in reinforcement learning with continuous action spaces. We develop a minimax estimator and introduce a policy-gradient-based algorithm to identify the in-class optimal policy. We provide theoretical results regarding the consistency, finite-sample error bound, and regret bound of the resulting optimal policy.
arXiv Detail & Related papers (2025-05-01T04:55:29Z)
SNPL: Simultaneous Policy Learning and Evaluation for Safe Multi-Objective Policy Improvement [33.60500554561509]
To design effective digital interventions, experimenters face the challenge of learning decision policies that balance multiple objectives using offline data. To provide credible recommendations, experimenters must not only identify policies that satisfy the desired changes in goal and guardrail outcomes, but also offer probabilistic guarantees about the changes these policies induce. In this paper, we provide safe noisy policy learning (SNPL), a novel approach that leverages the concept of algorithmic stability to address these challenges.
arXiv Detail & Related papers (2025-03-17T02:53:53Z)
Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning [7.888219789657414]
In safe offline reinforcement learning (RL), the objective is to develop a policy that maximizes cumulative rewards while strictly adhering to safety constraints. We address these issues with a novel approach that begins by learning a conservatively safe policy through the use of Conditional Variational Autoencoders. We frame this as a Constrained Reward-Return Maximization problem, wherein the policy aims to optimize rewards while complying with the inferred latent safety constraints.
arXiv Detail & Related papers (2024-12-11T22:00:07Z)
Statistical Analysis of Policy Space Compression Problem [54.1754937830779]
Policy search methods are crucial in reinforcement learning, offering a framework to address continuous state-action and partially observable problems. Reducing the policy space through policy compression emerges as a powerful, reward-free approach to accelerate the learning process. This technique condenses the policy space into a smaller, representative set while maintaining most of the original effectiveness.
arXiv Detail & Related papers (2024-11-15T02:46:55Z)
Positivity-free Policy Learning with Observational Data [8.293758599118618]
This study introduces a novel positivity-free (stochastic) policy learning framework. We propose incremental propensity score policies to adjust propensity score values instead of assigning fixed values to treatments. This paper provides a thorough exploration of the theoretical guarantees associated with policy learning and validates the proposed framework's finite-sample performance.
arXiv Detail & Related papers (2023-10-10T19:47:27Z)
Optimizing Credit Limit Adjustments Under Adversarial Goals Using Reinforcement Learning [42.303733194571905]
We seek to find and automatize an optimal credit card limit adjustment policy by employing reinforcement learning techniques. Our research establishes a conceptual structure for applying reinforcement learning framework to credit limit adjustment.
arXiv Detail & Related papers (2023-06-27T16:10:36Z)
Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data. Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees. We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z)
Pragmatic Fairness: Developing Policies with Outcome Disparity Control [15.618754942472822]
We introduce a causal framework for designing optimal policies that satisfy fairness constraints. We propose two different fairness constraints: a moderation breaking constraint and an equal benefit constraint.
arXiv Detail & Related papers (2023-01-28T19:25:56Z)
Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem. P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective. We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z)
A Regularized Implicit Policy for Offline Reinforcement Learning [54.7427227775581]
offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment. We propose a framework that supports learning a flexible yet well-regularized fully-implicit policy. Experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.
arXiv Detail & Related papers (2022-02-19T20:22:04Z)
Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions. We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions. We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z)
Off-Policy Optimization of Portfolio Allocation Policies under Constraints [0.8848340429852071]
Dynamic portfolio optimization problem in finance frequently requires learning policies that adhere to various constraints, driven by investor preferences and risk. We motivate this problem of finding an allocation policy within a sequential decision making framework and study the effects of: (a) using data collected under previously employed policies, which may be sub-optimal and constraint-violating, and (b) imposing desired constraints while computing near-optimal policies with this data.
arXiv Detail & Related papers (2020-12-21T22:22:04Z)
Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy. We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.