About Time: Model-free Reinforcement Learning with Timed Reward Machines
- URL: http://arxiv.org/abs/2512.17637v1
- Date: Fri, 19 Dec 2025 14:39:03 GMT
- Title: About Time: Model-free Reinforcement Learning with Timed Reward Machines
- Authors: Anirban Majumdar, Ritam Raha, Rajarshi Roy, David Parker, Marta Kwiatkowska,
- Abstract summary: Timed reward machines (TRMs) are an extension of reward machines that incorporate timing constraints into the reward structure.<n>We study model-free RL frameworks for learning optimal policies with TRMs under digital and real-time semantics.<n>Our algorithms integrate the TRM into learning via abstractions of timed automata, and employ counterfactual-imaginings that exploit the structure of the TRM to improve the search.
- Score: 13.525747021139084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reward specification plays a central role in reinforcement learning (RL), guiding the agent's behavior. To express non-Markovian rewards, formalisms such as reward machines have been introduced to capture dependencies on histories. However, traditional reward machines lack the ability to model precise timing constraints, limiting their use in time-sensitive applications. In this paper, we propose timed reward machines (TRMs), which are an extension of reward machines that incorporate timing constraints into the reward structure. TRMs enable more expressive specifications with tunable reward logic, for example, imposing costs for delays and granting rewards for timely actions. We study model-free RL frameworks (i.e., tabular Q-learning) for learning optimal policies with TRMs under digital and real-time semantics. Our algorithms integrate the TRM into learning via abstractions of timed automata, and employ counterfactual-imagining heuristics that exploit the structure of the TRM to improve the search. Experimentally, we demonstrate that our algorithm learns policies that achieve high rewards while satisfying the timing constraints specified by the TRM on popular RL benchmarks. Moreover, we conduct comparative studies of performance under different TRM semantics, along with ablations that highlight the benefits of counterfactual-imagining.
Related papers
- OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning [41.49024599460379]
Reward models (RMs) have become essential for aligning large language models (LLMs)<n>We introduce OpenRM, a tool-augmented long-form reward model that judges open-ended responses by invoking external tools to gather relevant evidence.<n>Experiments on three newly-collected datasets and two widely-used benchmarks demonstrate that OpenRM substantially outperforms existing reward modeling approaches.
arXiv Detail & Related papers (2025-10-28T17:02:46Z) - Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS [62.22644307952087]
We introduce AIRL-S, the first natural unification of RL-based and search-based TTS.<n>We leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces.<n>Our results show that our unified approach improves performance by 9 % on average over the base model, matching GPT-4o.
arXiv Detail & Related papers (2025-08-19T23:41:15Z) - Physics-Informed Reward Machines [4.7962647777554634]
Reward machines (RMs) provide a structured way to specify non-Markovian rewards in reinforcement learning (RL)<n>We introduce physics-informed reward machines (pRMs), a symbolic machine designed to express complex learning objectives and reward structures for RL agents.<n>We present RL algorithms capable of exploiting pRMs via counterfactual experiences and reward shaping.
arXiv Detail & Related papers (2025-08-14T18:46:54Z) - Pushdown Reward Machines for Reinforcement Learning [17.63980224819404]
We present pushdown reward machines (pdRMs), an extension of reward machines based on deterministic pushdown automata.<n>pdRMs can recognize and reward temporally extended behaviours representable in deterministic context-free languages.<n>We show how agents can be trained to perform tasks representable in deterministic context-free languages using pdRMs.
arXiv Detail & Related papers (2025-08-09T08:59:09Z) - Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner [31.033131727230277]
Large reasoning models (LRMs) have recently shown promise in solving complex math problems when optimized with Reinforcement Learning (RL)<n>We propose a novel intrinsic signal-driven generative process evaluation mechanism operating at the thought level to address major bottlenecks in RL-based training.<n>Experiments on 1.5B and 7B parameter LRMs demonstrate that our method achieves higher problem-solving accuracy with significantly fewer training samples than outcome-only reward baselines.
arXiv Detail & Related papers (2025-07-31T07:54:58Z) - Discriminative Policy Optimization for Token-Level Reward Models [55.98642069903191]
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs)<n>Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations.<n>Reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH.
arXiv Detail & Related papers (2025-05-29T11:40:34Z) - Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models [50.4652276723694]
Think-RM generates flexible, self-guided reasoning traces that support advanced capabilities.<n>Think-RM achieves state-of-the-art results on RM-Bench, outperforming both BT RM and vertically scaled GenRM by 8%.
arXiv Detail & Related papers (2025-05-22T05:56:11Z) - Reward Reasoning Model [104.39256985858428]
Reward Reasoning Models (RRMs) are designed to execute a deliberate reasoning process before generating final rewards.<n>We implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities.<n> Notably, RRMs can adaptively exploit test-time compute to further improve reward accuracy.
arXiv Detail & Related papers (2025-05-20T17:58:03Z) - RM-R1: Reward Modeling as Reasoning [81.50471199906738]
Reasoning Reward Models (ReasRMs) formulate reward modeling as a reasoning task.<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art performance across three reward model benchmarks on average.
arXiv Detail & Related papers (2025-05-05T06:11:12Z) - Hierarchies of Reward Machines [75.55324974788475]
Reward machines (RMs) are a recent formalism for representing the reward function of a reinforcement learning task through a finite-state machine.
We propose a formalism for further abstracting the subtask structure by endowing an RM with the ability to call other RMs.
arXiv Detail & Related papers (2022-05-31T12:39:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.