Related papers: C-IDS: Solving Contextual POMDP via Information-Directed Objective

C-IDS: Solving Contextual POMDP via Information-Directed Objective

URL: http://arxiv.org/abs/2602.03939v1
Date: Tue, 03 Feb 2026 19:00:34 GMT
Title: C-IDS: Solving Contextual POMDP via Information-Directed Objective
Authors: Chongyang Shi, Michael Dorothy, Jie Fu,
Abstract summary: We study the policy synthesis problem in contextual partially observable Markov decision processes.<n>Our goal is to design a policy that simultaneously maximizes cumulative return and actively reduces uncertainty about the underlying context.<n>We develop the C-IDS algorithm to synthesize policies that maximize the information-directed objective.
Score: 10.82202704907442
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the policy synthesis problem in contextual partially observable Markov decision processes (CPOMDPs), where the environment is governed by an unknown latent context that induces distinct POMDP dynamics. Our goal is to design a policy that simultaneously maximizes cumulative return and actively reduces uncertainty about the underlying context. We introduce an information-directed objective that augments reward maximization with mutual information between the latent context and the agent's observations. We develop the C-IDS algorithm to synthesize policies that maximize the information-directed objective. We show that the objective can be interpreted as a Lagrangian relaxation of the linear information ratio and prove that the temperature parameter is an upper bound on the information ratio. Based on this characterization, we establish a sublinear Bayesian regret bound over K episodes. We evaluate our approach on a continuous Light-Dark environment and show that it consistently outperforms standard POMDP solvers that treat the unknown context as a latent state variable, achieving faster context identification and higher returns.

Related papers

Learning to Decide with Just Enough: Information-Theoretic Context Summarization for CMDPs [23.111877248835736]
Contextual Markov Decision Processes (CMDPs) offer a framework for sequential decision-making under external signals.<n>We propose an information-theoretic summarization approach that uses large language models (LLMs) to compress contextual inputs into low-dimensional, semantically rich summaries.
arXiv Detail & Related papers (2025-10-02T02:52:24Z)
Influence Guided Context Selection for Effective Retrieval-Augmented Generation [23.188397777606095]
Retrieval-Augmented Generation (RAG) addresses large language model (LLM) hallucinations by grounding responses in external knowledge.<n>Existing approaches attempt to improve performance through context selection based on predefined context quality assessment metrics.<n>We reconceptualize context quality assessment as an inference-time data valuation problem and introduce the Contextual Influence Value (CI value)<n>This novel metric quantifies context quality by measuring the performance degradation when removing each context from the list.
arXiv Detail & Related papers (2025-09-21T07:19:09Z)
Differential Information Distribution: A Bayesian Perspective on Direct Preference Optimization [35.335072390336855]
We study the goal of preference optimization as learning the differential information required to update a reference policy into a target policy.<n>First, we find that DPO's log-ratio reward is uniquely justified when preferences encode the Differential Information needed to update a reference policy into the target policy.<n>Second, we discuss how commonly observed training dynamics in DPO, including changes in log-likelihood and policy exploration, stem from a power-law DID relationship.
arXiv Detail & Related papers (2025-05-29T17:59:50Z)
Salience-Invariant Consistent Policy Learning for Generalization in Visual Reinforcement Learning [12.9372563969007]
Generalizing policies to unseen scenarios remains a critical challenge in visual reinforcement learning.<n>In unseen environments, distracting pixels may lead agents to extract representations containing task-irrelevant information.<n>We propose the Salience-Invariant Consistent Policy Learning algorithm, an efficient framework for zero-shot generalization.
arXiv Detail & Related papers (2025-02-12T12:00:16Z)
Certifiably Robust Policies for Uncertain Parametric Environments [57.2416302384766]
We propose a framework based on parametric Markov decision processes (MDPs) with unknown distributions over parameters.<n>We learn and analyse IMDPs for a set of unknown sample environments induced by parameters.<n>We show that our approach produces tight bounds on a policy's performance with high confidence.
arXiv Detail & Related papers (2024-08-06T10:48:15Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
Age of Semantics in Cooperative Communications: To Expedite Simulation Towards Real via Offline Reinforcement Learning [53.18060442931179]
We propose the age of semantics (AoS) for measuring semantics freshness of status updates in a cooperative relay communication system. We derive an online deep actor-critic (DAC) learning scheme under the on-policy temporal difference learning framework. We then put forward a novel offline DAC scheme, which estimates the optimal control policy from a previously collected dataset.
arXiv Detail & Related papers (2022-09-19T11:55:28Z)
Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy Search [21.850348833971722]
We propose an information-directed objective for infinite-horizon reinforcement learning (RL) called the occupancy information ratio (OIR) The OIR enjoys rich underlying structure and presents an objective to which scalable, model-free policy search methods naturally apply. We show by leveraging connections between quasiconcave optimization and the linear programming theory for Markov decision processes that the OIR problem can be transformed and solved via concave programming methods when the underlying model is known.
arXiv Detail & Related papers (2022-01-21T18:40:03Z)
Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation. We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z)
Policy Information Capacity: Information-Theoretic Measure for Task Complexity in Deep Reinforcement Learning [83.66080019570461]
We propose two environment-agnostic, algorithm-agnostic quantitative metrics for task difficulty. We show that these metrics have higher correlations with normalized task solvability scores than a variety of alternatives. These metrics can also be used for fast and compute-efficient optimizations of key design parameters.
arXiv Detail & Related papers (2021-03-23T17:49:50Z)
Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions. We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions. We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z)
Adversarial Mutual Information for Text Generation [62.974883143784616]
We propose Adversarial Mutual Information (AMI): a text generation framework. AMI is formed as a novel saddle point (min-max) optimization aiming to identify joint interactions between the source and target. We show that AMI has potential to lead to a tighter lower bound of maximum mutual information.
arXiv Detail & Related papers (2020-06-30T19:11:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.