Related papers: Customize Multi-modal RAI Guardrails with Precedent-based predictions

Customize Multi-modal RAI Guardrails with Precedent-based predictions

URL: http://arxiv.org/abs/2507.20503v1
Date: Mon, 28 Jul 2025 03:45:34 GMT
Title: Customize Multi-modal RAI Guardrails with Precedent-based predictions
Authors: Cheng-Fu Yang, Thanh Tran, Christos Christodoulopoulos, Weitong Ruan, Rahul Gupta, Kai-Wei Chang,
Abstract summary: A multi-modal guardrail must effectively filter image content based on user-defined policies.<n>Existing fine-tuning methods typically condition predictions on pre-defined policies.<n>We propose to condition model's judgment on "precedents", which are the reasoning processes of prior data points similar to the given input.
Score: 55.63757336900865
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A multi-modal guardrail must effectively filter image content based on user-defined policies, identifying material that may be hateful, reinforce harmful stereotypes, contain explicit material, or spread misinformation. Deploying such guardrails in real-world applications, however, poses significant challenges. Users often require varied and highly customizable policies and typically cannot provide abundant examples for each custom policy. Consequently, an ideal guardrail should be scalable to the multiple policies and adaptable to evolving user standards with minimal retraining. Existing fine-tuning methods typically condition predictions on pre-defined policies, restricting their generalizability to new policies or necessitating extensive retraining to adapt. Conversely, training-free methods struggle with limited context lengths, making it difficult to incorporate all the policies comprehensively. To overcome these limitations, we propose to condition model's judgment on "precedents", which are the reasoning processes of prior data points similar to the given input. By leveraging precedents instead of fixed policies, our approach greatly enhances the flexibility and adaptability of the guardrail. In this paper, we introduce a critique-revise mechanism for collecting high-quality precedents and two strategies that utilize precedents for robust prediction. Experimental results demonstrate that our approach outperforms previous methods across both few-shot and full-dataset scenarios and exhibits superior generalization to novel policies.

Related papers

EXPO: Stable Reinforcement Learning with Expressive Policies [74.30151915786233]
We propose a sample-efficient online reinforcement learning algorithm to maximize value with two parameterized policies.<n>Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods.
arXiv Detail & Related papers (2025-07-10T17:57:46Z)
Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL [6.224756774400233]
We study offline reinforcement learning in average-reward MDPs, which presents increased challenges from the perspectives of distribution shift and non-uniform coverage.<n>We develop sharp guarantees depending only on the target policy, specifically the bias span and a novel policy hitting radius, yielding the first fully single-policy sample complexity bound for average-reward offline RL.
arXiv Detail & Related papers (2025-06-26T00:22:39Z)
SNPL: Simultaneous Policy Learning and Evaluation for Safe Multi-Objective Policy Improvement [33.60500554561509]
To design effective digital interventions, experimenters face the challenge of learning decision policies that balance multiple objectives using offline data.<n>To provide credible recommendations, experimenters must not only identify policies that satisfy the desired changes in goal and guardrail outcomes, but also offer probabilistic guarantees about the changes these policies induce.<n>In this paper, we provide safe noisy policy learning (SNPL), a novel approach that leverages the concept of algorithmic stability to address these challenges.
arXiv Detail & Related papers (2025-03-17T02:53:53Z)
Foundation Policies with Hilbert Representations [54.44869979017766]
We propose an unsupervised framework to pre-train generalist policies from unlabeled offline data. Our key insight is to learn a structured representation that preserves the temporal structure of the underlying environment. Our experiments show that our unsupervised policies can solve goal-conditioned and general RL tasks in a zero-shot fashion.
arXiv Detail & Related papers (2024-02-23T19:09:10Z)
Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies. Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z)
Local Policy Improvement for Recommender Systems [8.617221361305901]
We show how to train a new policy given data collected from a previously-deployed policy. We suggest an alternative approach of local policy improvement without off-policy correction. This local policy improvement paradigm is ideal for recommender systems, as previous policies are typically of decent quality and policies are updated frequently.
arXiv Detail & Related papers (2022-12-22T00:47:40Z)
Reliable Decision from Multiple Subtasks through Threshold Optimization: Content Moderation in the Wild [7.176020195419459]
Social media platforms struggle to protect users from harmful content through content moderation. These platforms have recently leveraged machine learning models to cope with the vast amount of user-generated content daily. Third-party content moderation services provide prediction scores of multiple subtasks, such as predicting the existence of underage personnel, rude gestures, or weapons. We introduce a simple yet effective threshold optimization method that searches the optimal thresholds of the multiple subtasks to make a reliable moderation decision in a cost-effective way.
arXiv Detail & Related papers (2022-08-16T03:51:43Z)
Conformal Off-Policy Prediction in Contextual Bandits [54.67508891852636]
Conformal off-policy prediction can output reliable predictive intervals for the outcome under a new target policy. We provide theoretical finite-sample guarantees without making any additional assumptions beyond the standard contextual bandit setup.
arXiv Detail & Related papers (2022-06-09T10:39:33Z)
Constructing a Good Behavior Basis for Transfer using Generalized Policy Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks. We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z)
First Order Constrained Optimization in Policy Space [19.00289722198614]
We propose a novel approach called First Order Constrained Optimization in Policy Space (FOCOPS) FOCOPS maximizes an agent's overall reward while ensuring the agent satisfies a set of cost constraints. We provide empirical evidence that our simple approach achieves better performance on a set of constrained robotics locomotive tasks.
arXiv Detail & Related papers (2020-02-16T05:07:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.