Multimodal Policy Internalization for Conversational Agents
- URL: http://arxiv.org/abs/2510.09474v1
- Date: Fri, 10 Oct 2025 15:28:30 GMT
- Title: Multimodal Policy Internalization for Conversational Agents
- Authors: Zhenhailong Wang, Jiateng Liu, Amin Fazel, Ritesh Sarkhel, Xing Fan, Xiang Li, Chenlei Guo, Heng Ji, Ruhi Sarikaya,
- Abstract summary: Multimodal Policy Internalization (MPI) is a new task that internalizes reasoning-intensive multimodal policies into model parameters.<n>We build two datasets spanning synthetic and real-world decision-making and tool-using tasks.<n>TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness.
- Score: 48.11601444262434
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research. Project page: https://mikewangwzhl.github.io/TriMPI.
Related papers
- VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning [30.278740496355507]
We propose a novel Multi-agent system for video understanding, namely VideoChat-M1.<n>Instead of using a single or fixed policy, VideoChat-M1 adopts a distinct Collaborative Policy Planning ( CPP) paradigm with multiple policy agents.<n>We show that VideoChat-M1 achieves SOTA performance across eight benchmarks spanning four tasks.
arXiv Detail & Related papers (2025-11-24T07:04:51Z) - Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies [18.428149174461264]
We present PBSUITE, a dynamic evaluation suite designed to assess large language models' capacity to adhere to pluralistic alignment specifications.<n>We find that leading open- and closed-source LLMs maintain robust adherence to behavioral policies in single-turn settings, but their compliance weakens substantially in multi-turn adversarial interactions.
arXiv Detail & Related papers (2025-11-07T06:43:01Z) - Discovering Interpretable Programmatic Policies via Multimodal LLM-assisted Evolutionary Search [21.02398143073197]
Interpretability and high performance are essential goals in designing control policies, particularly for safety-critical tasks.<n>This work introduces a novel approach for programmatic policy discovery, called Multimodal Large Language Model-assisted Search (MLES)<n>MLES utilizes multimodal large language models as policy generators, combining them with evolutionary mechanisms for automatic policy optimization.
arXiv Detail & Related papers (2025-08-07T14:24:03Z) - Learning Long-Context Diffusion Policies via Past-Token Prediction [48.86967836229684]
We propose an alternative approach that explicitly regularizes the retention of past information.<n>We introduce Past-Token Prediction, an auxiliary task in which the policy learns to predict past action tokens alongside future ones.<n> Experiments across four real-world and six simulated tasks demonstrate that our proposed method improves the performance of long-context diffusion policies by 3x and accelerates policy training by more than 10x.
arXiv Detail & Related papers (2025-05-14T17:00:47Z) - Simulation-Free Hierarchical Latent Policy Planning for Proactive Dialogues [31.92843134331582]
We introduce a novel dialogue policy planning framework, LDPP.<n>It fully automates the process from mining policies in dialogue records to learning policy planning.<n>Our experiments demonstrate that LDPP outperforms existing methods on two proactive scenarios.
arXiv Detail & Related papers (2024-12-19T07:06:01Z) - Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization [53.510942601223626]
Large Language Models (LLMs) exhibit robust problem-solving capabilities for diverse tasks.
These task solvers necessitate manually crafted prompts to inform task rules and regulate behaviors.
We propose Agent-Pro: an LLM-based Agent with Policy-level Reflection and Optimization.
arXiv Detail & Related papers (2024-02-27T15:09:20Z) - Residual Q-Learning: Offline and Online Policy Customization without
Value [53.47311900133564]
Imitation Learning (IL) is a widely used framework for learning imitative behavior from demonstrations.
We formulate a new problem setting called policy customization.
We propose a novel framework, Residual Q-learning, which can solve the formulated MDP by leveraging the prior policy.
arXiv Detail & Related papers (2023-06-15T22:01:19Z) - Constructing a Good Behavior Basis for Transfer using Generalized Policy
Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks.
We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z) - Goal-Conditioned Reinforcement Learning with Imagined Subgoals [89.67840168694259]
We propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks.
Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the policy and its critic.
We evaluate our approach on complex robotic navigation and manipulation tasks and show that it outperforms existing methods by a large margin.
arXiv Detail & Related papers (2021-07-01T15:30:59Z) - Continuous Action Reinforcement Learning from a Mixture of Interpretable
Experts [35.80418547105711]
We propose a policy scheme that retains a complex function approxor for its internal value predictions but constrains the policy to have a concise, hierarchical, and human-readable structure.
The main technical contribution of the paper is to address the challenges introduced by this non-differentiable state selection procedure.
arXiv Detail & Related papers (2020-06-10T16:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.