Related papers: A Unified Theory of Compositionality, Modularity, and Interpretability in Markov Decision Processes

A Unified Theory of Compositionality, Modularity, and Interpretability in Markov Decision Processes

URL: http://arxiv.org/abs/2506.09499v1
Date: Wed, 11 Jun 2025 08:21:22 GMT
Title: A Unified Theory of Compositionality, Modularity, and Interpretability in Markov Decision Processes
Authors: Thomas J. Ringstrom, Paul R. Schrater,
Abstract summary: We introduce Option Kernel Bellman Equations (OKBEs) for a new reward-free Markov Decision Process.<n>OKBEs directly construct and optimize a predictive map called a state-time option kernel (STOK) to maximize the probability of completing a goal.<n>We argue that reward-maximization is in conflict with the properties of compositionality, modularity, and interpretability.
Score: 1.3044677039636754
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Option Kernel Bellman Equations (OKBEs) for a new reward-free Markov Decision Process. Rather than a value function, OKBEs directly construct and optimize a predictive map called a state-time option kernel (STOK) to maximize the probability of completing a goal while avoiding constraint violations. STOKs are compositional, modular, and interpretable initiation-to-termination transition kernels for policies in the Options Framework of Reinforcement Learning. This means: 1) STOKs can be composed using Chapman-Kolmogorov equations to make spatiotemporal predictions for multiple policies over long horizons, 2) high-dimensional STOKs can be represented and computed efficiently in a factorized and reconfigurable form, and 3) STOKs record the probabilities of semantically interpretable goal-success and constraint-violation events, needed for formal verification. Given a high-dimensional state-transition model for an intractable planning problem, we can decompose it with local STOKs and goal-conditioned policies that are aggregated into a factorized goal kernel, making it possible to forward-plan at the level of goals in high-dimensions to solve the problem. These properties lead to highly flexible agents that can rapidly synthesize meta-policies, reuse planning representations across many tasks, and justify goals using empowerment, an intrinsic motivation function. We argue that reward-maximization is in conflict with the properties of compositionality, modularity, and interpretability. Alternatively, OKBEs facilitate these properties to support verifiable long-horizon planning and intrinsic motivation that scales to dynamic high-dimensional world-models.

Related papers

Recursive Reward Aggregation [51.552609126905885]
We propose an alternative approach for flexible behavior alignment that eliminates the need to modify the reward function.<n>By introducing an algebraic perspective on Markov decision processes (MDPs), we show that the Bellman equations naturally emerge from the generation and aggregation of rewards.<n>Our approach applies to both deterministic and deterministic settings and seamlessly integrates with value-based and actor-critic algorithms.
arXiv Detail & Related papers (2025-07-11T12:37:20Z)
Decomposability-Guaranteed Cooperative Coevolution for Large-Scale Itinerary Planning [6.565536870180592]
Large-scale itinerary planning is a variant of the traveling salesman problem.<n>This paper analyzes the decomposability of large-scale itinerary planning.<n>We propose a novel multi-objective cooperative coevolutionary algorithm for large-scale itinerary planning.
arXiv Detail & Related papers (2025-06-06T14:31:57Z)
Reinforced Reasoning for Embodied Planning [18.40186665383579]
Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and natural language goals.<n>We introduce a reinforcement fine-tuning framework that brings R1-style reasoning enhancement into embodied planning.
arXiv Detail & Related papers (2025-05-28T07:21:37Z)
Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning [94.76546523689113]
We introduce CodePlan, a framework that generates and follows textcode-form plans -- pseudocode that outlines high-level, structured reasoning processes. CodePlan effectively captures the rich semantics and control flows inherent to sophisticated reasoning tasks. It achieves a 25.1% relative improvement compared with directly generating responses.
arXiv Detail & Related papers (2024-09-19T04:13:58Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states. The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z)
When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)
Opportunistic Qualitative Planning in Stochastic Systems with Incomplete Preferences over Reachability Objectives [24.11353445650682]
Preferences play a key role in determining what goals/constraints to satisfy when not all constraints can be satisfied simultaneously. We present an algorithm to synthesize the SPI and SASI strategies that induce multiple sequential improvements.
arXiv Detail & Related papers (2022-10-04T19:53:08Z)
Modular Deep Reinforcement Learning for Continuous Motion Planning with Temporal Logic [59.94347858883343]
This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP) The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP. The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states.
arXiv Detail & Related papers (2021-02-24T01:11:25Z)
Goal Kernel Planning: Linearly-Solvable Non-Markovian Policies for Logical Tasks with Goal-Conditioned Options [54.40780660868349]
We introduce a compositional framework called Linearly-Solvable Goal Kernel Dynamic Programming (LS-GKDP)<n>LS-GKDP combines the Linearly-Solvable Markov Decision Process (LMDP) formalism with the Options Framework of Reinforcement Learning.<n>We show how an LMDP with a goal kernel enables the efficient optimization of meta-policies in a lower-dimensional subspace defined by the task grounding.
arXiv Detail & Related papers (2020-07-06T05:13:20Z)
Formal Controller Synthesis for Continuous-Space MDPs via Model-Free Reinforcement Learning [1.0928470926399565]
A novel reinforcement learning scheme to synthesize policies for continuous-space Markov decision processes (MDPs) is proposed. A key contribution of the paper is to leverage the classical convergence results for reinforcement learning on finite MDPs. We present a novel potential-based reward shaping technique to produce dense rewards to speed up learning.
arXiv Detail & Related papers (2020-03-02T08:29:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.