Analyzing and Internalizing Complex Policy Documents for LLM Agents
- URL: http://arxiv.org/abs/2510.11588v1
- Date: Mon, 13 Oct 2025 16:30:07 GMT
- Title: Analyzing and Internalizing Complex Policy Documents for LLM Agents
- Authors: Jiateng Liu, Zhenhailong Wang, Xiaojiang Huang, Yingjie Li, Xing Fan, Xiang Li, Chenlei Guo, Ruhi Sarikaya, Heng Ji,
- Abstract summary: Large Language Model (LLM)-based agentic systems rely on in-context policy documents encoding diverse business rules.<n>This motivates developing internalization methods that embed policy documents into model priors while preserving performance.<n>We introduce CC-Gen, an agentic benchmark generator with Controllable Complexity across four levels.
- Score: 53.14898416858099
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Model (LLM)-based agentic systems rely on in-context policy documents encoding diverse business rules. As requirements grow, these documents expand rapidly, causing high computational overhead. This motivates developing internalization methods that embed policy documents into model priors while preserving performance. Prior prompt compression work targets generic prompts, but agentic policy documents span multiple complexity levels and require deeper reasoning, making internalization harder. We introduce CC-Gen, an agentic benchmark generator with Controllable Complexity across four levels, enabling systematic evaluation of agents' ability to handle complexity and offering a unified framework for assessing policy internalization. Our analysis shows that complex policy specifications governing workflows pose major reasoning challenges. Supporting internalization with gold user agent interaction trajectories containing chain-of-thought (CoT) annotations via supervised fine-tuning (SFT) is data-intensive and degrades sharply as policy complexity increases. To mitigate data and reasoning burdens, we propose Category-Aware Policy Continued Pretraining (CAP-CPT). Our automated pipeline parses policy documents to extract key specifications, grouping them into factual, behavioral, and conditional categories, and isolating complex conditions that drive workflow complexity. This guides targeted data synthesis and enables agents to internalize policy information through an autoregressive pretraining loss. Experiments show CAP-CPT improves SFT baselines in all settings, with up to 41% and 22% gains on Qwen-3-32B, achieving 97.3% prompt length reduction on CC-Gen and further enhancing tau-Bench with minimal SFT data.
Related papers
- Policy Compiler for Secure Agentic Systems [20.346157626726725]
We present PCAS, a Policy Compiler for Agentic Systems that provides deterministic policy enforcement.<n>We evaluate PCAS on three case studies: information flow policies for prompt injection defense, approval in a multi-agent pharmacovigilance system, and organizational policies for customer service.
arXiv Detail & Related papers (2026-02-18T18:57:12Z) - DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing [53.85037373860246]
We introduce Deep Synth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities.<n>We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization)<n>Our results demonstrate that agentic plan-and-write significantly outperform single-turn generation.
arXiv Detail & Related papers (2026-01-07T03:07:52Z) - Interact-RAG: Reason and Interact with the Corpus, Beyond Black-Box Retrieval [49.85856484781787]
We introduce Interact-RAG, a new paradigm that elevates the LLM agent into an active manipulator of the retrieval process.<n>We develop a reasoning-enhanced workflow, which enables both zero-shot execution and the synthesis of interaction trajectories.<n>Experiments across six benchmarks demonstrate that Interact-RAG significantly outperforms other advanced methods.
arXiv Detail & Related papers (2025-10-31T15:48:43Z) - AgentAsk: Multi-Agent Systems Need to Ask [26.13279490836716]
Multi-agent systems built on large language models (LLMs) promise enhanced problem-solving capabilities through collaborative division of labor.<n>We propose AgentAsk, a lightweight and plug-and-play clarification module that treats every inter-agent message as a potential failure point and inserts minimally necessary questions to arrest error propagation.<n>AgentAsk consistently improves accuracy and robustness over public multi-agent implementations while keeping overhead minimal, with latency and extra cost all less than 5%.
arXiv Detail & Related papers (2025-10-08T22:36:05Z) - Transparent, Evaluable, and Accessible Data Agents: A Proof-of-Concept Framework [0.0]
This article presents a modular, component-based architecture for developing and evaluating AI agents.<n>The system addresses core challenges in data accessibility by enabling non-technical users to interact with complex data warehouses.<n>A cornerstone of the design is its commitment to transparent decision-making, achieved through a multi-layered reasoning framework.
arXiv Detail & Related papers (2025-09-28T23:54:41Z) - Query-Centric Diffusion Policy for Generalizable Robotic Assembly [35.15799846535565]
We propose a hierarchical framework that bridges high-level planning and low-level control by utilizing queries comprising objects, contact points, and skill information.<n>We conduct comprehensive experiments on the FurnitureBench in both simulation and real-world settings, demonstrating improved performance in skill precision and long-horizon success rate.
arXiv Detail & Related papers (2025-09-23T06:10:46Z) - Enterprise AI Must Enforce Participant-Aware Access Control [9.68210477539956]
Large language models (LLMs) are increasingly deployed in enterprise settings where they interact with multiple users and are trained or fine-tuned on sensitive internal data.<n>We show that adversaries can exploit current fine-tuning and RAG architectures to leak sensitive information by leveraging the lack of access control enforcement.<n>We introduce a framework centered on the principle that any content used in training, retrieval, or generation by an LLM is explicitly authorized for emphall users involved in the interaction.
arXiv Detail & Related papers (2025-09-18T04:30:49Z) - Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.83217247686402]
Large Language Models (LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions.<n>Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance.<n>This paper decomposes LLM applications into a three-layer architecture: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, and textbftextitLLM Inference Core.
arXiv Detail & Related papers (2025-08-28T13:00:28Z) - On Automating Security Policies with Contemporary LLMs [3.47402794691087]
In this paper, we present a framework for automating attack mitigation policy compliance through an innovative combination of in-context learning and retrieval-augmented generation (RAG)<n>Our empirical evaluation, conducted using publicly available CTI policies in STIXv2 format and Windows API documentation, demonstrates significant improvements in precision, recall, and F1-score when employing RAG compared to a non-RAG baseline.
arXiv Detail & Related papers (2025-06-05T09:58:00Z) - AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios [51.46347732659174]
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications.<n>AgentIF is the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios.
arXiv Detail & Related papers (2025-05-22T17:31:10Z) - Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity [59.57065228857247]
Retrieval-augmented Large Language Models (LLMs) have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA)
We propose a novel adaptive QA framework, that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs based on the query complexity.
We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems.
arXiv Detail & Related papers (2024-03-21T13:52:30Z) - Distributed-Training-and-Execution Multi-Agent Reinforcement Learning
for Power Control in HetNet [48.96004919910818]
We propose a multi-agent deep reinforcement learning (MADRL) based power control scheme for the HetNet.
To promote cooperation among agents, we develop a penalty-based Q learning (PQL) algorithm for MADRL systems.
In this way, an agent's policy can be learned by other agents more easily, resulting in a more efficient collaboration process.
arXiv Detail & Related papers (2022-12-15T17:01:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.