Scaling Policy Compliance Assessment in Language Models with Policy Reasoning Traces
- URL: http://arxiv.org/abs/2509.23291v1
- Date: Sat, 27 Sep 2025 13:10:21 GMT
- Title: Scaling Policy Compliance Assessment in Language Models with Policy Reasoning Traces
- Authors: Joseph Marvin Imperial, Harish Tayyar Madabushi,
- Abstract summary: Policy Reasoning Traces (PRT) is a form of specialized generated reasoning chains that serve as a reasoning bridge to improve an LLM's policy compliance assessment capabilities.<n>Our empirical evaluations demonstrate that the use of PRTs for both inference-time and training-time scenarios significantly enhances the performance of open-weight and commercial models.
- Score: 12.671657542087624
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Policy compliance assessment is a fundamental task of evaluating whether an input case strictly complies with a set of human-defined rules, more generally known as policies. In practice, human experts follow a systematic, step-by-step process to identify violations with respect to specific stipulations outlined in the policy. However, such documentation of gold-standard, expert-level reasoning processes is costly to acquire. In this paper, we introduce Policy Reasoning Traces (PRT), a form of specialized generated reasoning chains that serve as a reasoning bridge to improve an LLM's policy compliance assessment capabilities. Our empirical evaluations demonstrate that the use of PRTs for both inference-time and training-time scenarios significantly enhances the performance of open-weight and commercial models, setting a new state-of-the-art for HIPAA and GDPR policies. Beyond accuracy gains, we also highlight how PRTs can improve an LLM's ability to accurately cite policy clauses, as well as influence compliance decisions through their high utilization from the raw chains of thought.
Related papers
- Autonomous Agents and Policy Compliance: A Framework for Reasoning About Penalties [1.2891210250935148]
This paper presents a logic programming-based framework for policy-aware autonomous agents that can reason about potential penalties for non-compliance.<n>Our framework extends Gelfond and Lobo's Authorization and Obligation Policy Language (AOPL) to incorporate penalties.<n>Our method ensures well-formed policies, accounts for policy priorities, and enhances explainability by explicitly identifying rule violations.
arXiv Detail & Related papers (2025-12-03T16:29:09Z) - Pragmatic Policy Development via Interpretable Behavior Cloning [6.177449809243359]
We propose deriving treatment policies from the most frequently chosen actions in each patient state, as estimated by an interpretable model of the behavior policy.<n>We demonstrate that policies derived under this framework can outperform current practice, offering interpretable alternatives to those obtained via offline RL.
arXiv Detail & Related papers (2025-07-22T22:34:35Z) - Learning Deterministic Policies with Policy Gradients in Constrained Markov Decision Processes [59.27926064817273]
We introduce an exploration-agnostic algorithm, called C-PG, which enjoys global last-iterate convergence guarantees under domination assumptions.<n>We empirically validate both the action-based (C-PGAE) and parameter-based (C-PGPE) variants of C-PG on constrained control tasks.
arXiv Detail & Related papers (2025-06-06T10:29:05Z) - SPoRt -- Safe Policy Ratio: Certified Training and Deployment of Task Policies in Model-Free RL [54.022106606140774]
We present theoretical results that place a bound on the probability of violating a safety property for a new task-specific policy in a model-free, episodic setting.<n>This bound can be applied to temporally-extended properties (beyond safety) and to robust control problems.<n>We present experimental results demonstrating this trade-off and comparing the theoretical bound to posterior bounds derived from empirical violation rates.
arXiv Detail & Related papers (2025-04-08T19:09:07Z) - Rule-Guided Reinforcement Learning Policy Evaluation and Improvement [9.077163856137505]
LEGIBLE is a novel approach to improving deep reinforcement learning policies.<n>It starts by mining rules from a deep RL policy, constituting a partially symbolic representation.<n>In the second step, we generalize the mined rules using domain knowledge expressed as metamorphic relations.<n>The third step is evaluating generalized rules to determine which generalizations improve performance when enforced.
arXiv Detail & Related papers (2025-03-12T11:13:08Z) - Few-shot Policy (de)composition in Conversational Question Answering [54.259440408606515]
We propose a neuro-symbolic framework to detect policy compliance using large language models (LLMs) in a few-shot setting.<n>We show that our approach soundly reasons about policy compliance conversations by extracting sub-questions to be answered, assigning truth values from contextual information, and explicitly producing a set of logic statements from the given policies.<n>We apply this approach to the popular PCD and conversational machine reading benchmark, ShARC, and show competitive performance with no task-specific finetuning.
arXiv Detail & Related papers (2025-01-20T08:40:15Z) - Understanding Reference Policies in Direct Preference Optimization [50.67309013764383]
Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs)
This work explores an under-investigated aspect of DPO - its dependency on the reference model or policy.
arXiv Detail & Related papers (2024-07-18T17:08:10Z) - On the Value of Myopic Behavior in Policy Reuse [67.37788288093299]
Leveraging learned strategies in unfamiliar scenarios is fundamental to human intelligence.
In this work, we present a framework called Selective Myopic bEhavior Control(SMEC)
SMEC adaptively aggregates the sharable short-term behaviors of prior policies and the long-term behaviors of the task policy, leading to coordinated decisions.
arXiv Detail & Related papers (2023-05-28T03:59:37Z) - Policy Learning Using Weak Supervision [18.540550726629995]
We aim for a unified framework that leverages the available cheap weak supervisions to perform policy learning efficiently.
Our approach explicitly punishes a policy for overfitting to the weak supervision.
In addition to theoretical guarantees, extensive evaluations on tasks including RL with noisy rewards, BC with weak demonstrations, and standard policy co-training show that our method leads to substantial performance improvements.
arXiv Detail & Related papers (2020-10-05T02:26:08Z) - Expert-Supervised Reinforcement Learning for Offline Policy Learning and
Evaluation [21.703965401500913]
We propose an Expert-Supervised RL (ESRL) framework which uses uncertainty quantification for offline policy learning.
In particular, we have three contributions: 1) the method can learn safe and optimal policies through hypothesis testing, 2) ESRL allows for different levels of risk averse implementations tailored to the application context, and 3) we propose a way to interpret ESRL's policy at every state through posterior distributions.
arXiv Detail & Related papers (2020-06-23T17:43:44Z) - Efficient Evaluation of Natural Stochastic Policies in Offline
Reinforcement Learning [80.42316902296832]
We study the efficient off-policy evaluation of natural policies, which are defined in terms of deviations from the behavior policy.
This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies.
arXiv Detail & Related papers (2020-06-06T15:08:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.