Pushdown Reward Machines for Reinforcement Learning
- URL: http://arxiv.org/abs/2508.06894v1
- Date: Sat, 09 Aug 2025 08:59:09 GMT
- Title: Pushdown Reward Machines for Reinforcement Learning
- Authors: Giovanni Varricchione, Toryn Q. Klassen, Natasha Alechina, Mehdi Dastani, Brian Logan, Sheila A. McIlraith,
- Abstract summary: We present pushdown reward machines (pdRMs), an extension of reward machines based on deterministic pushdown automata.<n>pdRMs can recognize and reward temporally extended behaviours representable in deterministic context-free languages.<n>We show how agents can be trained to perform tasks representable in deterministic context-free languages using pdRMs.
- Score: 17.63980224819404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reward machines (RMs) are automata structures that encode (non-Markovian) reward functions for reinforcement learning (RL). RMs can reward any behaviour representable in regular languages and, when paired with RL algorithms that exploit RM structure, have been shown to significantly improve sample efficiency in many domains. In this work, we present pushdown reward machines (pdRMs), an extension of reward machines based on deterministic pushdown automata. pdRMs can recognize and reward temporally extended behaviours representable in deterministic context-free languages, making them more expressive than reward machines. We introduce two variants of pdRM-based policies, one which has access to the entire stack of the pdRM, and one which can only access the top $k$ symbols (for a given constant $k$) of the stack. We propose a procedure to check when the two kinds of policies (for a given environment, pdRM, and constant $k$) achieve the same optimal expected reward. We then provide theoretical results establishing the expressive power of pdRMs, and space complexity results about the proposed learning problems. Finally, we provide experimental results showing how agents can be trained to perform tasks representable in deterministic context-free languages using pdRMs.
Related papers
- RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models [86.61108562387993]
RLAR (Reinforcement Learning from Agent Rewards) is an agent-driven framework that dynamically assigns tailored reward functions to individual queries.<n>We show that RLAR yields consistent performance gains ranging from 10 to 60 across mathematics, coding, translation, and dialogue tasks.
arXiv Detail & Related papers (2026-02-28T16:14:43Z) - About Time: Model-free Reinforcement Learning with Timed Reward Machines [13.525747021139084]
Timed reward machines (TRMs) are an extension of reward machines that incorporate timing constraints into the reward structure.<n>We study model-free RL frameworks for learning optimal policies with TRMs under digital and real-time semantics.<n>Our algorithms integrate the TRM into learning via abstractions of timed automata, and employ counterfactual-imaginings that exploit the structure of the TRM to improve the search.
arXiv Detail & Related papers (2025-12-19T14:39:03Z) - Physics-Informed Reward Machines [4.7962647777554634]
Reward machines (RMs) provide a structured way to specify non-Markovian rewards in reinforcement learning (RL)<n>We introduce physics-informed reward machines (pRMs), a symbolic machine designed to express complex learning objectives and reward structures for RL agents.<n>We present RL algorithms capable of exploiting pRMs via counterfactual experiences and reward shaping.
arXiv Detail & Related papers (2025-08-14T18:46:54Z) - Discriminative Policy Optimization for Token-Level Reward Models [55.98642069903191]
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs)<n>Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations.<n>Reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH.
arXiv Detail & Related papers (2025-05-29T11:40:34Z) - Learning to Reason without External Rewards [100.27210579418562]
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision.<n>We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data.<n>We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal.
arXiv Detail & Related papers (2025-05-26T07:01:06Z) - Reward Reasoning Model [104.39256985858428]
Reward Reasoning Models (RRMs) are designed to execute a deliberate reasoning process before generating final rewards.<n>We implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities.<n> Notably, RRMs can adaptively exploit test-time compute to further improve reward accuracy.
arXiv Detail & Related papers (2025-05-20T17:58:03Z) - FORM: Learning Expressive and Transferable First-Order Logic Reward Machines [48.36822060760614]
Reward machines (RMs) are an effective approach for addressing non-Markovian rewards in reinforcement learning.<n>We propose First-Order Reward Machines ($texttFORM$s), which use first-order logic to label edges.<n>We show that $texttFORM$s can be effectively learnt for tasks where traditional RM learning approaches fail.
arXiv Detail & Related papers (2024-12-31T09:31:15Z) - Hierarchies of Reward Machines [75.55324974788475]
Reward machines (RMs) are a recent formalism for representing the reward function of a reinforcement learning task through a finite-state machine.
We propose a formalism for further abstracting the subtask structure by endowing an RM with the ability to call other RMs.
arXiv Detail & Related papers (2022-05-31T12:39:24Z) - Reward Machines: Exploiting Reward Function Structure in Reinforcement
Learning [22.242379207077217]
We show how to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies.
First, we propose reward machines, a type of finite state machine that supports the specification of reward functions.
We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning.
arXiv Detail & Related papers (2020-10-06T00:10:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.