Understanding Curriculum Learning in Policy Optimization for Online
Combinatorial Optimization
- URL: http://arxiv.org/abs/2202.05423v3
- Date: Fri, 3 Nov 2023 21:12:09 GMT
- Title: Understanding Curriculum Learning in Policy Optimization for Online
Combinatorial Optimization
- Authors: Runlong Zhou, Zelin He, Yuandong Tian, Yi Wu, Simon S. Du
- Abstract summary: This paper presents the first systematic study on policy optimization methods for online CO problems.
We show that online CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG)
Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift.
- Score: 66.35750142827898
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Over the recent years, reinforcement learning (RL) starts to show promising
results in tackling combinatorial optimization (CO) problems, in particular
when coupled with curriculum learning to facilitate training. Despite emerging
empirical evidence, theoretical study on why RL helps is still at its early
stage. This paper presents the first systematic study on policy optimization
methods for online CO problems. We show that online CO problems can be
naturally formulated as latent Markov Decision Processes (LMDPs), and prove
convergence bounds on natural policy gradient (NPG) for solving LMDPs.
Furthermore, our theory explains the benefit of curriculum learning: it can
find a strong sampling policy and reduce the distribution shift, a critical
quantity that governs the convergence rate in our theorem. For a canonical
online CO problem, the Best Choice Problem (BCP), we formally prove that
distribution shift is reduced exponentially with curriculum learning even if
the curriculum is a randomly generated BCP on a smaller scale. Our theory also
shows we can simplify the curriculum learning scheme used in prior work from
multi-step to single-step. Lastly, we provide extensive experiments on the Best
Choice Problem, Online Knapsack, and AdWords to verify our findings.
Related papers
- Efficient Imitation Learning with Conservative World Models [54.52140201148341]
We tackle the problem of policy learning from expert demonstrations without a reward function.
We re-frame imitation learning as a fine-tuning problem, rather than a pure reinforcement learning one.
arXiv Detail & Related papers (2024-05-21T20:53:18Z) - Offline Imitation Learning from Multiple Baselines with Applications to Compiler Optimization [17.729842629392742]
We study a Reinforcement Learning problem in which we are given a set of trajectories collected with K baseline policies.
The goal is to learn a policy which performs as well as the best combination of baselines on the entire state space.
arXiv Detail & Related papers (2024-03-28T14:34:02Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Graph Q-Learning for Combinatorial Optimization [44.8086492019594]
Graph Neural Networks (GNNs) have been shown to be effective at solving prediction and inference problems on graph data.
We propose and demonstrate that GNNs can be applied to solve Combinatorial Optimization problems.
arXiv Detail & Related papers (2024-01-11T01:15:28Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Optimizing Solution-Samplers for Combinatorial Problems: The Landscape
of Policy-Gradient Methods [52.0617030129699]
We introduce a novel theoretical framework for analyzing the effectiveness of DeepMatching Networks and Reinforcement Learning methods.
Our main contribution holds for a broad class of problems including Max-and Min-Cut, Max-$k$-Bipartite-Bi, Maximum-Weight-Bipartite-Bi, and Traveling Salesman Problem.
As a byproduct of our analysis we introduce a novel regularization process over vanilla descent and provide theoretical and experimental evidence that it helps address vanishing-gradient issues and escape bad stationary points.
arXiv Detail & Related papers (2023-10-08T23:39:38Z) - Realistic Unsupervised CLIP Fine-tuning with Universal Entropy Optimization [101.08992036691673]
This paper explores a realistic unsupervised fine-tuning scenario, considering the presence of out-of-distribution samples from unknown classes.
In particular, we focus on simultaneously enhancing out-of-distribution detection and the recognition of instances associated with known classes.
We present a simple, efficient, and effective approach called Universal Entropy Optimization (UEO)
arXiv Detail & Related papers (2023-08-24T16:47:17Z) - Provably Efficient Offline Goal-Conditioned Reinforcement Learning with
General Function Approximation and Single-Policy Concentrability [11.786486763236104]
Goal-conditioned reinforcement learning (GCRL) refers to learning general-purpose skills that aim to reach diverse goals.
offline GCRL only requires purely pre-collected datasets to perform training tasks.
We show that a modified offline GCRL algorithm is both provably efficient with general function approximation and single-policy concentrability.
arXiv Detail & Related papers (2023-02-07T22:04:55Z) - Unsupervised Learning for Combinatorial Optimization Needs Meta-Learning [14.86600327306136]
A general framework of unsupervised learning for optimization (CO) is to train a neural network (NN) whose output gives a problem solution by directly optimizing the CO objective.
We propose a new objective of unsupervised learning for CO where the goal of learning is to search for good initialization for future problem instances rather than give direct solutions.
We observe that even just the initial solution given by our model before fine-tuning can significantly outperform the baselines under various evaluation settings.
arXiv Detail & Related papers (2023-01-08T22:14:59Z) - SOLO: Search Online, Learn Offline for Combinatorial Optimization
Problems [4.777801093677586]
We study problems with real world applications such as machine scheduling, routing, and assignment.
We propose a method that combines Reinforcement Learning (RL) and planning.
This method can equally be applied to both the offline, as well as online, variants of the problem, in which the problem components are not known in advance, but rather arrive during the decision-making process.
arXiv Detail & Related papers (2021-04-04T17:12:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.