Related papers: Expressive Reward Synthesis with the Runtime Monitoring Language

Expressive Reward Synthesis with the Runtime Monitoring Language

URL: http://arxiv.org/abs/2510.16185v2
Date: Tue, 21 Oct 2025 10:04:30 GMT
Title: Expressive Reward Synthesis with the Runtime Monitoring Language
Authors: Daniel Donnelly, Angelo Ferrando, Francesco Belardinelli,
Abstract summary: Key challenge in reinforcement learning (RL) is reward (mis)specification, whereby imprecisely defined reward functions can result in unintended, possibly harmful, behaviours.<n> Reward Machines address this issue by representing reward functions as finite state automata, enabling the specification of structured non-Markovian reward functions.<n>We build on Monitoring Language (RML) to develop a novel class of language-based Reward Machines.
Score: 9.817136453608365
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A key challenge in reinforcement learning (RL) is reward (mis)specification, whereby imprecisely defined reward functions can result in unintended, possibly harmful, behaviours. Indeed, reward functions in RL are typically treated as black-box mappings from state-action pairs to scalar values. While effective in many settings, this approach provides no information about why rewards are given, which can hinder learning and interpretability. Reward Machines address this issue by representing reward functions as finite state automata, enabling the specification of structured, non-Markovian reward functions. However, their expressivity is typically bounded by regular languages, leaving them unable to capture more complex behaviours such as counting or parametrised conditions. In this work, we build on the Runtime Monitoring Language (RML) to develop a novel class of language-based Reward Machines. By leveraging the built-in memory of RML, our approach can specify reward functions for non-regular, non-Markovian tasks. We demonstrate the expressiveness of our approach through experiments, highlighting additional advantages in flexible event-handling and task specification over existing Reward Machine-based methods.

Related papers

RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models [86.61108562387993]
RLAR (Reinforcement Learning from Agent Rewards) is an agent-driven framework that dynamically assigns tailored reward functions to individual queries.<n>We show that RLAR yields consistent performance gains ranging from 10 to 60 across mathematics, coding, translation, and dialogue tasks.
arXiv Detail & Related papers (2026-02-28T16:14:43Z)
LinguaFluid: Language Guided Fluid Control via Semantic Rewards in Reinforcement Learning [0.7864304771129751]
We introduce a semantically aligned reinforcement learning method where rewards are computed by aligning the current state with a target semantic instruction.<n>We show that semantic reward can guide learning to achieve competitive control behavior, even in the absence of hand-crafted reward functions.<n>This framework opens new horizons for aligning agent behavior with natural language goals and lays the groundwork for a more seamless integration of larger language models.
arXiv Detail & Related papers (2025-08-08T03:23:56Z)
Recursive Reward Aggregation [60.51668865089082]
We propose an alternative approach for flexible behavior alignment that eliminates the need to modify the reward function.<n>By introducing an algebraic perspective on Markov decision processes (MDPs), we show that the Bellman equations naturally emerge from the generation and aggregation of rewards.<n>Our approach applies to both deterministic and deterministic settings and seamlessly integrates with value-based and actor-critic algorithms.
arXiv Detail & Related papers (2025-07-11T12:37:20Z)
Learning to Reason without External Rewards [100.27210579418562]
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision.<n>We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data.<n>We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal.
arXiv Detail & Related papers (2025-05-26T07:01:06Z)
Average Reward Reinforcement Learning for Omega-Regular and Mean-Payoff Objectives [9.657038158333139]
We present the first model-free reinforcement learning framework that translates absolute liveness specifications to average-reward objectives.<n>We also introduce a reward structure for lexicographic multi-objective optimization.<n> Empirical results show our average-reward approach in continuing setting outperforms discount-based methods across benchmarks.
arXiv Detail & Related papers (2025-05-21T16:06:51Z)
Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning [49.87923965553233]
Reinforcement Learning can lead to reward over-optimization in large language models. We introduce the Reward from Demonstration (RCfD) to recalibrate the reward objective. We show that RCfD achieves comparable performance to carefully tuned baselines while mitigating ROO.
arXiv Detail & Related papers (2024-04-30T09:57:21Z)
Automated Feature Selection for Inverse Reinforcement Learning [7.278033100480175]
Inverse reinforcement learning (IRL) is an imitation learning approach to learning reward functions from expert demonstrations. We propose a method that employs basis functions to form a candidate set of features. We demonstrate the approach's effectiveness by recovering reward functions that capture expert policies.
arXiv Detail & Related papers (2024-03-22T10:05:21Z)
Language Reward Modulation for Pretraining Reinforcement Learning [61.76572261146311]
We propose leveraging the capabilities of LRFs as a pretraining signal for reinforcement learning. Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks.
arXiv Detail & Related papers (2023-08-23T17:37:51Z)
Contrastive Example-Based Control [163.6482792040079]
We propose a method for offline, example-based control that learns an implicit model of multi-step transitions, rather than a reward function. Across a range of state-based and image-based offline control tasks, our method outperforms baselines that use learned reward functions.
arXiv Detail & Related papers (2023-07-24T19:43:22Z)
Model-Free Reinforcement Learning for Symbolic Automata-encoded Objectives [0.0]
Reinforcement learning (RL) is a popular approach for robotic path planning in uncertain environments. Poorly designed rewards can lead to policies that do get maximal rewards but fail to satisfy desired task objectives or are unsafe. We propose using formal specifications in the form of symbolic automata.
arXiv Detail & Related papers (2022-02-04T21:54:36Z)
Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning [22.242379207077217]
We show how to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies. First, we propose reward machines, a type of finite state machine that supports the specification of reward functions. We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning.
arXiv Detail & Related papers (2020-10-06T00:10:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.