Fitted Q-Learning for Relational Domains
- URL: http://arxiv.org/abs/2006.05595v1
- Date: Wed, 10 Jun 2020 01:18:47 GMT
- Title: Fitted Q-Learning for Relational Domains
- Authors: Srijita Das, Sriraam Natarajan, Kaushik Roy, Ronald Parr and Kristian
Kersting
- Abstract summary: We develop the first relational fitted Q-learning algorithms by representing the value function and Bellman residuals.
We show how the two steps of Bellman operator; application and projection steps can be performed using a gradient-boosting technique.
- Score: 29.90646258513537
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of Approximate Dynamic Programming in relational
domains. Inspired by the success of fitted Q-learning methods in propositional
settings, we develop the first relational fitted Q-learning algorithms by
representing the value function and Bellman residuals. When we fit the
Q-functions, we show how the two steps of Bellman operator; application and
projection steps can be performed using a gradient-boosting technique. Our
proposed framework performs reasonably well on standard domains without using
domain models and using fewer training trajectories.
Related papers
- Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization [50.485788083202124]
Reinforcement Learning (RL) plays a crucial role in aligning large language models with human preferences and improving their ability to perform complex tasks.
We introduce Direct Q-function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model.
Experimental results on two math problem-solving datasets, GSM8K and MATH, demonstrate that DQO outperforms previous methods, establishing it as a promising offline reinforcement learning approach for aligning language models.
arXiv Detail & Related papers (2024-10-11T23:29:20Z) - Trainability maximization using estimation of distribution algorithms assisted by surrogate modelling for quantum architecture search [8.226785409557598]
Quantum architecture search (QAS) involves optimizing both the quantum parametric circuit configuration but also its parameters for a variational quantum algorithm.
In this paper, we aim to achieve two improvements in QAS: (1) to reduce the number of measurements by an online surrogate model of the evaluation process that aggressively discards architectures of poor performance; (2) to avoid training the circuits when BPs are present.
We experimentally validate our proposal for the variational quantum eigensolver and showcase that our algorithm is able to find solutions that have been previously proposed in the literature for the Hamiltonians; but also to outperform the state of the
arXiv Detail & Related papers (2024-07-29T15:22:39Z) - Regularized Q-Learning with Linear Function Approximation [2.765106384328772]
We consider a bi-level optimization formulation of regularized Q-learning with linear functional approximation.
We show that, under certain assumptions, the proposed algorithm converges to a stationary point in the presence of Markovian noise.
arXiv Detail & Related papers (2024-01-26T20:45:40Z) - Pointer Networks with Q-Learning for Combinatorial Optimization [55.2480439325792]
We introduce the Pointer Q-Network (PQN), a hybrid neural architecture that integrates model-free Q-value policy approximation with Pointer Networks (Ptr-Nets)
Our empirical results demonstrate the efficacy of this approach, also testing the model in unstable environments.
arXiv Detail & Related papers (2023-11-05T12:03:58Z) - Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z) - Convex Q Learning in a Stochastic Environment: Extended Version [1.680268810119084]
The paper introduces the first formulation of convex Q-learning for Markov decision processes with function approximation.
The proposed algorithms are convergent and new techniques are introduced to obtain the rate of convergence in a mean-square sense.
The theory is illustrated with an application to a classical inventory control problem.
arXiv Detail & Related papers (2023-09-10T18:24:43Z) - Optimising Rolling Stock Planning including Maintenance with Constraint
Programming and Quantum Annealing [0.0]
We propose and compare Constraint Programming (CP) and Quantum Annealing (QA) approaches for rolling stock assignment optimisation.
In the CP approach, we model the problem with an Alldifferent constraint, extensions of the Element constraint, and logical implications.
For the QA approach, we develop a quadratic unconstrained binary optimisation (QUBO) model.
arXiv Detail & Related papers (2021-09-15T11:00:53Z) - A Value-Function-based Interior-point Method for Non-convex Bi-level
Optimization [38.75417864443519]
Bi-level optimization model is able to capture a wide range of complex learning tasks with practical interest.
We propose a new interior Bi-level Value-based Interior-point scheme, we penalize the regularized value function of the lower level problem into the upper level objective.
arXiv Detail & Related papers (2021-06-15T09:10:40Z) - Learning MDPs from Features: Predict-Then-Optimize for Sequential
Decision Problems by Reinforcement Learning [52.74071439183113]
We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) solved via reinforcement learning.
Two significant computational challenges arise in applying decision-focused learning to MDPs.
arXiv Detail & Related papers (2021-06-06T23:53:31Z) - Outcome-Driven Reinforcement Learning via Variational Inference [95.82770132618862]
We discuss a new perspective on reinforcement learning, recasting it as the problem of inferring actions that achieve desired outcomes, rather than a problem of maximizing rewards.
To solve the resulting outcome-directed inference problem, we establish a novel variational inference formulation that allows us to derive a well-shaped reward function.
We empirically demonstrate that this method eliminates the need to design reward functions and leads to effective goal-directed behaviors.
arXiv Detail & Related papers (2021-04-20T18:16:21Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.