Related papers: Lazy-MDPs: Towards Interpretable Reinforcement Learning by Learning When to Act

Lazy-MDPs: Towards Interpretable Reinforcement Learning by Learning When to Act

URL: http://arxiv.org/abs/2203.08542v1
Date: Wed, 16 Mar 2022 11:06:25 GMT
Title: Lazy-MDPs: Towards Interpretable Reinforcement Learning by Learning When to Act
Authors: Alexis Jacq, Johan Ferret, Olivier Pietquin, Matthieu Geist
Abstract summary: We propose to augment the standard Markov Decision Process and make a new mode of action available: being lazy. We study the theoretical properties of lazy-MDPs, expressing value functions and characterizing optimal solutions. We deem those states and corresponding actions important since they explain the difference in performance between the default and the new, lazy policy.
Score: 42.909535340099296
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Traditionally, Reinforcement Learning (RL) aims at deciding how to act optimally for an artificial agent. We argue that deciding when to act is equally important. As humans, we drift from default, instinctive or memorized behaviors to focused, thought-out behaviors when required by the situation. To enhance RL agents with this aptitude, we propose to augment the standard Markov Decision Process and make a new mode of action available: being lazy, which defers decision-making to a default policy. In addition, we penalize non-lazy actions in order to encourage minimal effort and have agents focus on critical decisions only. We name the resulting formalism lazy-MDPs. We study the theoretical properties of lazy-MDPs, expressing value functions and characterizing optimal solutions. Then we empirically demonstrate that policies learned in lazy-MDPs generally come with a form of interpretability: by construction, they show us the states where the agent takes control over the default policy. We deem those states and corresponding actions important since they explain the difference in performance between the default and the new, lazy policy. With suboptimal policies as default (pretrained or random), we observe that agents are able to get competitive performance in Atari games while only taking control in a limited subset of states.

Related papers

When Can Model-Free Reinforcement Learning be Enough for Thinking? [3.5253513747455303]
This paper builds a domain-independent understanding of when model-free RL will lead to "thinking" as a strategy for reward.<n>We show formally that thought actions are equivalent to the agent choosing to perform a step of policy improvement before continuing to act.<n>We then show that open-source LLMs satisfy the conditions that our theory predicts are necessary for model-free RL to produce thinking-like behavior.
arXiv Detail & Related papers (2025-06-20T16:23:46Z)
Deontically Constrained Policy Improvement in Reinforcement Learning Agents [0.0]
Markov Decision Processes (MDPs) are the most common model for decision making under uncertainty in the Machine Learning community.<n>An MDP captures non-determinism, probabilistic uncertainty, and an explicit model of action.<n>A Reinforcement Learning (RL) agent learns to act in an MDP by maximizing a utility function.
arXiv Detail & Related papers (2025-06-08T01:01:06Z)
FDPP: Fine-tune Diffusion Policy with Human Preference [57.44575105114056]
Fine-tuning Diffusion Policy with Human Preference learns a reward function through preference-based learning. This reward is then used to fine-tune the pre-trained policy with reinforcement learning. Experiments demonstrate that FDPP effectively customizes policy behavior without compromising performance.
arXiv Detail & Related papers (2025-01-14T17:15:27Z)
Tackling Decision Processes with Non-Cumulative Objectives using Reinforcement Learning [0.0]
We introduce a general mapping of non-cumulative Markov decision processes to standard MDPs. This allows all techniques developed to find optimal policies for MDPs to be directly applied to the larger class of NCMDPs. We show applications in a diverse set of tasks, including classical control, portfolio optimization in finance, and discrete optimization problems.
arXiv Detail & Related papers (2024-05-22T13:01:37Z)
PARTNR: Pick and place Ambiguity Resolving by Trustworthy iNteractive leaRning [5.046831208137847]
We present the PARTNR algorithm that can detect ambiguities in the trained policy by analyzing multiple modalities in the pick and place poses. PARTNR employs an adaptive, sensitivity-based, gating function that decides if additional user demonstrations are required. We demonstrate the performance of PARTNR in a table-top pick and place task.
arXiv Detail & Related papers (2022-11-15T17:07:40Z)
Formalizing the Problem of Side Effect Regularization [81.97441214404247]
We propose a formal criterion for side effect regularization via the assistance game framework. In these games, the agent solves a partially observable Markov decision process. We show that this POMDP is solved by trading off the proxy reward with the agent's ability to achieve a range of future tasks.
arXiv Detail & Related papers (2022-06-23T16:36:13Z)
Mildly Conservative Q-Learning for Offline Reinforcement Learning [63.2183622958666]
offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment. Existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic. We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values.
arXiv Detail & Related papers (2022-06-09T19:44:35Z)
A State-Distribution Matching Approach to Non-Episodic Reinforcement Learning [61.406020873047794]
A major hurdle to real-world application arises from the development of algorithms in an episodic setting. We propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations. Our experiments show that MEDAL matches or outperforms prior methods on three sparse-reward continuous control tasks.
arXiv Detail & Related papers (2022-05-11T00:06:29Z)
Dealing with the Unknown: Pessimistic Offline Reinforcement Learning [25.30634466168587]
We propose a Pessimistic Offline Reinforcement Learning (PessORL) algorithm to actively lead the agent back to the area where it is familiar. We focus on problems caused by out-of-distribution (OOD) states, and deliberately penalize high values at states that are absent in the training dataset.
arXiv Detail & Related papers (2021-11-09T22:38:58Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
Off-Belief Learning [21.98027225621791]
We present off-belief learning (OBL) to learn optimal policies that are fully grounded. OBL converges to a unique policy, making it more suitable for zero-shot coordination. OBL shows strong performance in both a simple toy-setting and the benchmark human-AI/zero-shot coordination problem Hanabi.
arXiv Detail & Related papers (2021-03-06T01:09:55Z)
BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy. We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent. We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.