Combinatorial Pure Exploration with Full-bandit Feedback and Beyond:
Solving Combinatorial Optimization under Uncertainty with Limited Observation
- URL: http://arxiv.org/abs/2012.15584v2
- Date: Tue, 29 Aug 2023 13:35:42 GMT
- Title: Combinatorial Pure Exploration with Full-bandit Feedback and Beyond:
Solving Combinatorial Optimization under Uncertainty with Limited Observation
- Authors: Yuko Kuroki, Junya Honda, Masashi Sugiyama
- Abstract summary: When developing an algorithm for optimization, it is commonly assumed that parameters such as edge weights are exactly known as inputs.
In this article, we review recently proposed techniques for pure exploration problems with limited feedback.
- Score: 70.41056265629815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Combinatorial optimization is one of the fundamental research fields that has
been extensively studied in theoretical computer science and operations
research. When developing an algorithm for combinatorial optimization, it is
commonly assumed that parameters such as edge weights are exactly known as
inputs. However, this assumption may not be fulfilled since input parameters
are often uncertain or initially unknown in many applications such as
recommender systems, crowdsourcing, communication networks, and online
advertisement. To resolve such uncertainty, the problem of combinatorial pure
exploration of multi-armed bandits (CPE) and its variants have recieved
increasing attention. Earlier work on CPE has studied the semi-bandit feedback
or assumed that the outcome from each individual edge is always accessible at
all rounds. However, due to practical constraints such as a budget ceiling or
privacy concern, such strong feedback is not always available in recent
applications. In this article, we review recently proposed techniques for
combinatorial pure exploration problems with limited feedback.
Related papers
- Time-Varying Gaussian Process Bandits with Unknown Prior [18.93478528448966]
PE-GP-UCB is capable of solving time-varying Bayesian optimisation problems.
It relies on the fact that either the observed function values are consistent with some of the priors.
arXiv Detail & Related papers (2024-02-02T18:52:16Z) - Combinatorial Optimization with Policy Adaptation using Latent Space Search [44.12073954093942]
We present a novel approach for designing performant algorithms to solve complex, typically NP-hard, problems.
We show that our search strategy outperforms state-of-the-art approaches on 11 standard benchmarking tasks.
arXiv Detail & Related papers (2023-11-13T12:24:54Z) - Optimizing Solution-Samplers for Combinatorial Problems: The Landscape
of Policy-Gradient Methods [52.0617030129699]
We introduce a novel theoretical framework for analyzing the effectiveness of DeepMatching Networks and Reinforcement Learning methods.
Our main contribution holds for a broad class of problems including Max-and Min-Cut, Max-$k$-Bipartite-Bi, Maximum-Weight-Bipartite-Bi, and Traveling Salesman Problem.
As a byproduct of our analysis we introduce a novel regularization process over vanilla descent and provide theoretical and experimental evidence that it helps address vanishing-gradient issues and escape bad stationary points.
arXiv Detail & Related papers (2023-10-08T23:39:38Z) - Branch & Learn with Post-hoc Correction for Predict+Optimize with
Unknown Parameters in Constraints [5.762370982168012]
Post-hoc Regret is a loss function that takes into account the cost of correcting an unsatisfiable prediction.
We show how to compute Post-hoc Regret exactly for any optimization problem solvable by a recursion algorithm satisfying simple conditions.
arXiv Detail & Related papers (2023-03-12T16:23:58Z) - Generative Slate Recommendation with Reinforcement Learning [49.75985313698214]
reinforcement learning algorithms can be used to optimize user engagement in recommender systems.
However, RL approaches are intractable in the slate recommendation scenario.
In that setting, an action corresponds to a slate that may contain any combination of items.
In this work we propose to encode slates in a continuous, low-dimensional latent space learned by a variational auto-encoder.
We are able to (i) relax assumptions required by previous work, and (ii) improve the quality of the action selection by modeling full slates.
arXiv Detail & Related papers (2023-01-20T15:28:09Z) - Pessimistic Off-Policy Optimization for Learning to Rank [13.733459243449634]
Off-policy learning is a framework for optimizing policies without deploying them.
In recommender systems, this is especially challenging due to the imbalance in logged data.
We study pessimistic off-policy optimization for learning to rank.
arXiv Detail & Related papers (2022-06-06T12:58:28Z) - Incentivizing Combinatorial Bandit Exploration [87.08827496301839]
Consider a bandit algorithm that recommends actions to self-interested users in a recommendation system.
Users are free to choose other actions and need to be incentivized to follow the algorithm's recommendations.
While the users prefer to exploit, the algorithm can incentivize them to explore by leveraging the information collected from the previous users.
arXiv Detail & Related papers (2022-06-01T13:46:25Z) - Parallelizing Contextual Linear Bandits [82.65675585004448]
We present a family of (parallel) contextual linear bandit algorithms, whose regret is nearly identical to their perfectly sequential counterparts.
We also present an empirical evaluation of these parallel algorithms in several domains, including materials discovery and biological sequence design problems.
arXiv Detail & Related papers (2021-05-21T22:22:02Z) - Constraint-Handling Techniques for Particle Swarm Optimization
Algorithms [0.0]
Population-based methods can cope with a variety of different problems, including problems of remarkably higher complexity than those traditional methods can handle.
The aim here is to develop and compare different CHTs suitable for PSOs, which are incorporated to an algorithm with general-purpose settings.
arXiv Detail & Related papers (2021-01-25T01:49:10Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.