Lazy-MDPs: Towards Interpretable Reinforcement Learning by Learning When
to Act
- URL: http://arxiv.org/abs/2203.08542v1
- Date: Wed, 16 Mar 2022 11:06:25 GMT
- Title: Lazy-MDPs: Towards Interpretable Reinforcement Learning by Learning When
to Act
- Authors: Alexis Jacq, Johan Ferret, Olivier Pietquin, Matthieu Geist
- Abstract summary: We propose to augment the standard Markov Decision Process and make a new mode of action available: being lazy.
We study the theoretical properties of lazy-MDPs, expressing value functions and characterizing optimal solutions.
We deem those states and corresponding actions important since they explain the difference in performance between the default and the new, lazy policy.
- Score: 42.909535340099296
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Traditionally, Reinforcement Learning (RL) aims at deciding how to act
optimally for an artificial agent. We argue that deciding when to act is
equally important. As humans, we drift from default, instinctive or memorized
behaviors to focused, thought-out behaviors when required by the situation. To
enhance RL agents with this aptitude, we propose to augment the standard Markov
Decision Process and make a new mode of action available: being lazy, which
defers decision-making to a default policy. In addition, we penalize non-lazy
actions in order to encourage minimal effort and have agents focus on critical
decisions only. We name the resulting formalism lazy-MDPs. We study the
theoretical properties of lazy-MDPs, expressing value functions and
characterizing optimal solutions. Then we empirically demonstrate that policies
learned in lazy-MDPs generally come with a form of interpretability: by
construction, they show us the states where the agent takes control over the
default policy. We deem those states and corresponding actions important since
they explain the difference in performance between the default and the new,
lazy policy. With suboptimal policies as default (pretrained or random), we
observe that agents are able to get competitive performance in Atari games
while only taking control in a limited subset of states.
Related papers
- When Can Model-Free Reinforcement Learning be Enough for Thinking? [3.5253513747455303]
This paper builds a domain-independent understanding of when model-free RL will lead to "thinking" as a strategy for reward.<n>We show formally that thought actions are equivalent to the agent choosing to perform a step of policy improvement before continuing to act.<n>We then show that open-source LLMs satisfy the conditions that our theory predicts are necessary for model-free RL to produce thinking-like behavior.
arXiv Detail & Related papers (2025-06-20T16:23:46Z) - Deontically Constrained Policy Improvement in Reinforcement Learning Agents [0.0]
Markov Decision Processes (MDPs) are the most common model for decision making under uncertainty in the Machine Learning community.<n>An MDP captures non-determinism, probabilistic uncertainty, and an explicit model of action.<n>A Reinforcement Learning (RL) agent learns to act in an MDP by maximizing a utility function.
arXiv Detail & Related papers (2025-06-08T01:01:06Z) - FDPP: Fine-tune Diffusion Policy with Human Preference [57.44575105114056]
Fine-tuning Diffusion Policy with Human Preference learns a reward function through preference-based learning.
This reward is then used to fine-tune the pre-trained policy with reinforcement learning.
Experiments demonstrate that FDPP effectively customizes policy behavior without compromising performance.
arXiv Detail & Related papers (2025-01-14T17:15:27Z) - Tackling Decision Processes with Non-Cumulative Objectives using Reinforcement Learning [0.0]
We introduce a general mapping of non-cumulative Markov decision processes to standard MDPs.
This allows all techniques developed to find optimal policies for MDPs to be directly applied to the larger class of NCMDPs.
We show applications in a diverse set of tasks, including classical control, portfolio optimization in finance, and discrete optimization problems.
arXiv Detail & Related papers (2024-05-22T13:01:37Z) - PARTNR: Pick and place Ambiguity Resolving by Trustworthy iNteractive
leaRning [5.046831208137847]
We present the PARTNR algorithm that can detect ambiguities in the trained policy by analyzing multiple modalities in the pick and place poses.
PARTNR employs an adaptive, sensitivity-based, gating function that decides if additional user demonstrations are required.
We demonstrate the performance of PARTNR in a table-top pick and place task.
arXiv Detail & Related papers (2022-11-15T17:07:40Z) - Formalizing the Problem of Side Effect Regularization [81.97441214404247]
We propose a formal criterion for side effect regularization via the assistance game framework.
In these games, the agent solves a partially observable Markov decision process.
We show that this POMDP is solved by trading off the proxy reward with the agent's ability to achieve a range of future tasks.
arXiv Detail & Related papers (2022-06-23T16:36:13Z) - Mildly Conservative Q-Learning for Offline Reinforcement Learning [63.2183622958666]
offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment.
Existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic.
We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values.
arXiv Detail & Related papers (2022-06-09T19:44:35Z) - A State-Distribution Matching Approach to Non-Episodic Reinforcement
Learning [61.406020873047794]
A major hurdle to real-world application arises from the development of algorithms in an episodic setting.
We propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations.
Our experiments show that MEDAL matches or outperforms prior methods on three sparse-reward continuous control tasks.
arXiv Detail & Related papers (2022-05-11T00:06:29Z) - Dealing with the Unknown: Pessimistic Offline Reinforcement Learning [25.30634466168587]
We propose a Pessimistic Offline Reinforcement Learning (PessORL) algorithm to actively lead the agent back to the area where it is familiar.
We focus on problems caused by out-of-distribution (OOD) states, and deliberately penalize high values at states that are absent in the training dataset.
arXiv Detail & Related papers (2021-11-09T22:38:58Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Off-Belief Learning [21.98027225621791]
We present off-belief learning (OBL) to learn optimal policies that are fully grounded.
OBL converges to a unique policy, making it more suitable for zero-shot coordination.
OBL shows strong performance in both a simple toy-setting and the benchmark human-AI/zero-shot coordination problem Hanabi.
arXiv Detail & Related papers (2021-03-06T01:09:55Z) - BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy.
We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent.
We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.