Scalable and Safe Remediation of Defective Actions in Self-Learning
Conversational Systems
- URL: http://arxiv.org/abs/2305.10528v1
- Date: Wed, 17 May 2023 19:22:24 GMT
- Title: Scalable and Safe Remediation of Defective Actions in Self-Learning
Conversational Systems
- Authors: Sarthak Ahuja, Mohammad Kachuee, Fateme Sheikholeslami, Weiqing Liu,
Jaeyoung Do
- Abstract summary: Off-Policy reinforcement learning has been a driving force for the state-of-the-art conversational AIs.
In large-scale commercial settings, it is often challenging to balance between policy improvements and experience continuity.
We propose a method for curating and leveraging high-precision samples sourced from historical regression incident reports.
- Score: 14.030576576114818
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Off-Policy reinforcement learning has been a driving force for the
state-of-the-art conversational AIs leading to more natural humanagent
interactions and improving the user satisfaction for goal-oriented agents.
However, in large-scale commercial settings, it is often challenging to balance
between policy improvements and experience continuity on the broad spectrum of
applications handled by such system. In the literature, off-policy evaluation
and guard-railing on aggregate statistics has been commonly used to address
this problem. In this paper, we propose a method for curating and leveraging
high-precision samples sourced from historical regression incident reports to
validate, safe-guard, and improve policies prior to the online deployment. We
conducted extensive experiments using data from a real-world conversational
system and actual regression incidents. The proposed method is currently
deployed in our production system to protect customers against broken
experiences and enable long-term policy improvements.
Related papers
- Customize Multi-modal RAI Guardrails with Precedent-based predictions [55.63757336900865]
A multi-modal guardrail must effectively filter image content based on user-defined policies.<n>Existing fine-tuning methods typically condition predictions on pre-defined policies.<n>We propose to condition model's judgment on "precedents", which are the reasoning processes of prior data points similar to the given input.
arXiv Detail & Related papers (2025-07-28T03:45:34Z) - Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.
Current approaches typically address this issue through online sampling from the target policy.
We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z) - Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment [2.9775785740619254]
Large Language Models (LLMs) have demonstrated powerful capabilities that render them valuable in different applications, including conversational AI products.
It is paramount to ensure the security and reliability of these products by mitigating their vulnerabilities towards malicious user interactions.
We present a study on the efficacy of fine-tuning and aligning Chain-of-Thought (CoT) responses of different LLMs that serve as input moderation guardrails.
arXiv Detail & Related papers (2025-01-22T18:40:57Z) - Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning [7.888219789657414]
In safe offline reinforcement learning (RL), the objective is to develop a policy that maximizes cumulative rewards while strictly adhering to safety constraints.
We address these issues with a novel approach that begins by learning a conservatively safe policy through the use of Conditional Variational Autoencoders.
We frame this as a Constrained Reward-Return Maximization problem, wherein the policy aims to optimize rewards while complying with the inferred latent safety constraints.
arXiv Detail & Related papers (2024-12-11T22:00:07Z) - Iterative Batch Reinforcement Learning via Safe Diversified Model-based Policy Search [2.0072624123275533]
Batch reinforcement learning enables policy learning without direct interaction with the environment during training.
This approach is well-suited for high-risk and cost-intensive applications, such as industrial control.
We present an algorithmic methodology for iterative batch reinforcement learning based on ensemble-based model-based policy search.
arXiv Detail & Related papers (2024-11-14T11:10:36Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Using In-Context Learning to Improve Dialogue Safety [45.303005593685036]
We investigate a retrieval-based method for reducing bias and toxicity in responses from chatbots.
It uses in-context learning to steer a model towards safer generations.
We find our method performs competitively with strong baselines without requiring training.
arXiv Detail & Related papers (2023-02-02T04:46:03Z) - Safety Correction from Baseline: Towards the Risk-aware Policy in
Robotics via Dual-agent Reinforcement Learning [64.11013095004786]
We propose a dual-agent safe reinforcement learning strategy consisting of a baseline and a safe agent.
Such a decoupled framework enables high flexibility, data efficiency and risk-awareness for RL-based control.
The proposed method outperforms the state-of-the-art safe RL algorithms on difficult robot locomotion and manipulation tasks.
arXiv Detail & Related papers (2022-12-14T03:11:25Z) - Constrained Policy Optimization for Controlled Self-Learning in
Conversational AI Systems [18.546197100318693]
We introduce a scalable framework for supporting fine-grained exploration targets for individual domains via user-defined constraints.
We present a novel meta-gradient learning approach that is scalable and practical to address this problem.
We conduct extensive experiments using data from a real-world conversational AI on a set of realistic constraint benchmarks.
arXiv Detail & Related papers (2022-09-17T23:44:13Z) - Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step
Q-learning: A Novel Correction Approach [0.0]
We introduce a novel policy similarity measure to mitigate the effects of such discrepancy in continuous control.
Our method offers an adequate single-step off-policy correction that is applicable to deterministic policy networks.
arXiv Detail & Related papers (2022-08-01T11:33:12Z) - Safe Reinforcement Learning via Confidence-Based Filters [78.39359694273575]
We develop a control-theoretic approach for certifying state safety constraints for nominal policies learned via standard reinforcement learning techniques.
We provide formal safety guarantees, and empirically demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-07-04T11:43:23Z) - Benchmarks for Deep Off-Policy Evaluation [152.28569758144022]
We present a collection of policies that can be used for benchmarking off-policy evaluation.
The goal of our benchmark is to provide a standardized measure of progress that is motivated from a set of principles.
We provide open-source access to our data and code to foster future research in this area.
arXiv Detail & Related papers (2021-03-30T18:09:33Z) - Causal-aware Safe Policy Improvement for Task-oriented dialogue [45.88777832381149]
We propose a batch RL framework for task oriented dialogue policy learning: causal safe policy improvement (CASPI)
We demonstrate the effectiveness of this framework on a dialogue-context-to-text Generation and end-to-end dialogue task of the Multiwoz2.0 dataset.
arXiv Detail & Related papers (2021-03-10T22:34:28Z) - Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions.
We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions.
We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.