State Action Separable Reinforcement Learning
- URL: http://arxiv.org/abs/2006.03713v1
- Date: Fri, 5 Jun 2020 22:02:57 GMT
- Title: State Action Separable Reinforcement Learning
- Authors: Ziyao Zhang and Liang Ma and Kin K. Leung and Konstantinos Poularakis
and Mudhakar Srivatsa
- Abstract summary: We propose a new learning paradigm, State Action Separable Reinforcement Learning (sasRL)
sasRL, wherein the action space is decoupled from the value function learning process for higher efficiency.
Experiments on several gaming scenarios show that sasRL outperforms state-of-the-art MDP-based RL algorithms by up to $75%$.
- Score: 11.04892417160547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning (RL) based methods have seen their paramount successes
in solving serial decision-making and control problems in recent years. For
conventional RL formulations, Markov Decision Process (MDP) and
state-action-value function are the basis for the problem modeling and policy
evaluation. However, several challenging issues still remain. Among most cited
issues, the enormity of state/action space is an important factor that causes
inefficiency in accurately approximating the state-action-value function. We
observe that although actions directly define the agents' behaviors, for many
problems the next state after a state transition matters more than the action
taken, in determining the return of such a state transition. In this regard, we
propose a new learning paradigm, State Action Separable Reinforcement Learning
(sasRL), wherein the action space is decoupled from the value function learning
process for higher efficiency. Then, a light-weight transition model is learned
to assist the agent to determine the action that triggers the associated state
transition. In addition, our convergence analysis reveals that under certain
conditions, the convergence time of sasRL is $O(T^{1/k})$, where $T$ is the
convergence time for updating the value function in the MDP-based formulation
and $k$ is a weighting factor. Experiments on several gaming scenarios show
that sasRL outperforms state-of-the-art MDP-based RL algorithms by up to
$75\%$.
Related papers
- Towards Cost Sensitive Decision Making [14.279123976398926]
In this work, we consider RL models that may actively acquire features from the environment to improve the decision quality and certainty.
We propose the Active-Acquisition POMDP and identify two types of the acquisition process for different application domains.
In order to assist the agent in the actively-acquired partially-observed environment and alleviate the exploration-exploitation dilemma, we develop a model-based approach.
arXiv Detail & Related papers (2024-10-04T19:48:23Z) - Sublinear Regret for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems [10.404992912881601]
We study reinforcement learning for a class of continuous-time linear-quadratic (LQ) control problems for diffusions.
We apply a model-free approach that relies neither on knowledge of model parameters nor on their estimations, and devise an actor-critic algorithm to learn the optimal policy parameter directly.
arXiv Detail & Related papers (2024-07-24T12:26:21Z) - Efficient Reinforcement Learning with Impaired Observability: Learning
to Act with Delayed and Missing State Observations [92.25604137490168]
This paper introduces a theoretical investigation into efficient reinforcement learning in control systems.
We present algorithms and establish near-optimal regret upper and lower bounds, of the form $tildemathcalO(sqrtrm poly(H) SAK)$, for RL in the delayed and missing observation settings.
arXiv Detail & Related papers (2023-06-02T02:46:39Z) - Model-Based Reinforcement Learning with Multinomial Logistic Function Approximation [10.159501412046508]
We study model-based reinforcement learning (RL) for episodic Markov decision processes (MDP)
We establish a provably efficient RL algorithm for the MDP whose state transition is given by a multinomial logistic model.
To the best of our knowledge, this is the first model-based RL algorithm with multinomial logistic function approximation with provable guarantees.
arXiv Detail & Related papers (2022-12-27T16:25:09Z) - Value-Consistent Representation Learning for Data-Efficient
Reinforcement Learning [105.70602423944148]
We propose a novel method, called value-consistent representation learning (VCR), to learn representations that are directly related to decision-making.
Instead of aligning this imagined state with a real state returned by the environment, VCR applies a $Q$-value head on both states and obtains two distributions of action values.
It has been demonstrated that our methods achieve new state-of-the-art performance for search-free RL algorithms.
arXiv Detail & Related papers (2022-06-25T03:02:25Z) - An Experimental Design Perspective on Model-Based Reinforcement Learning [73.37942845983417]
In practical applications of RL, it is expensive to observe state transitions from the environment.
We propose an acquisition function that quantifies how much information a state-action pair would provide about the optimal solution to a Markov decision process.
arXiv Detail & Related papers (2021-12-09T23:13:57Z) - Exploratory State Representation Learning [63.942632088208505]
We propose a new approach called XSRL (eXploratory State Representation Learning) to solve the problems of exploration and SRL in parallel.
On one hand, it jointly learns compact state representations and a state transition estimator which is used to remove unexploitable information from the representations.
On the other hand, it continuously trains an inverse model, and adds to the prediction error of this model a $k$-step learning progress bonus to form the objective of a discovery policy.
arXiv Detail & Related papers (2021-09-28T10:11:07Z) - Exploiting Submodular Value Functions For Scaling Up Active Perception [60.81276437097671]
In active perception tasks, agent aims to select sensory actions that reduce uncertainty about one or more hidden variables.
Partially observable Markov decision processes (POMDPs) provide a natural model for such problems.
As the number of sensors available to the agent grows, the computational cost of POMDP planning grows exponentially.
arXiv Detail & Related papers (2020-09-21T09:11:36Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.