Related papers: A Fast Convergence Theory for Offline Decision Making

Related papers

MOSIC: Model-Agnostic Optimal Subgroup Identification with Multi-Constraint for Improved Reliability [11.997050225896679]
We propose a unified optimization framework that directly solves the primal constrained optimization problem to identify optimal subgroups.<n>Our key innovation is a reformulation of the constrained primal problem as an unconstrained differentiable min-max objective, solved via a gradient descent-ascent algorithm.<n>The framework is model-agnostic, compatible with a wide range of CATE estimators, and propensity to additional constraints like cost limits or fairness criteria.
arXiv Detail & Related papers (2025-04-29T16:25:23Z)
Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning. We show that the widely used beam search method suffers from unacceptable over-optimism. We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z)
Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation [24.299769025346368]
We study the reinforcement learning problem in a constrained decision process (CMDP)<n>We propose an RL algorithm for linear CMDPs that achieves $tildemathcalO(sqrtK)$ regret with an episode-wise zero-violation guarantee.<n>Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.
arXiv Detail & Related papers (2025-02-14T13:07:25Z)
Offline Learning for Combinatorial Multi-armed Bandits [56.96242764723241]
Off-CMAB is the first offline learning framework for CMAB. Off-CMAB combines pessimistic reward estimations with solvers. Experiments on synthetic and real-world datasets highlight the superior performance of CLCB.
arXiv Detail & Related papers (2025-01-31T16:56:18Z)
Beyond Non-Degeneracy: Revisiting Certainty Equivalent Heuristic for Online Linear Programming [18.371947752008744]
We show that Certainty Equivalent achieves uniformly near-optimal regret under mild assumptions on the underlying distribution. Our result implies that, contrary to prior belief, CE effectively beats the curse of degeneracy for a wide range of problem instances. These techniques may find potential applications in broader online decision-making contexts.
arXiv Detail & Related papers (2025-01-03T09:21:27Z)
Performative Reinforcement Learning with Linear Markov Decision Process [14.75815792682734]
We study the setting of emphperformative reinforcement learning where the deployed policy affects both the reward and the transition of the underlying Markov decision process. We generalize the results to emphlinear Markov decision processes which is the primary theoretical model of large-scale MDPs.
arXiv Detail & Related papers (2024-11-07T23:04:48Z)
Offline Behavior Distillation [57.6900189406964]
Massive reinforcement learning (RL) data are typically collected to train policies offline without the need for interactions. We formulate offline behavior distillation (OBD), which synthesizes limited expert behavioral data from sub-optimal RL data. We propose two naive OBD objectives, DBC and PBC, which measure distillation performance via the decision difference between policies trained on distilled data and either offline data or a near-expert policy.
arXiv Detail & Related papers (2024-10-30T06:28:09Z)
Is Offline Decision Making Possible with Only Few Samples? Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement [25.68354404229254]
We show that even in a data-starved setting it may still be possible to find a policy competitive with the optimal one. This paves the way to reliable decision-making in settings where critical decisions must be made by relying only on a handful of samples.
arXiv Detail & Related papers (2024-02-24T03:41:09Z)
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training. For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage [100.8180383245813]
We propose value-based algorithms for offline reinforcement learning (RL) We show an analogous result for vanilla Q-functions under a soft margin condition. Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
arXiv Detail & Related papers (2023-02-05T14:22:41Z)
Offline Reinforcement Learning with Instrumental Variables in Confounded Markov Decision Processes [93.61202366677526]
We study the offline reinforcement learning (RL) in the face of unmeasured confounders. We propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy.
arXiv Detail & Related papers (2022-09-18T22:03:55Z)
Distributionally Robust Model-Based Offline Reinforcement Learning with Near-Optimal Sample Complexity [39.886149789339335]
offline reinforcement learning aims to learn to perform decision making from history data without active exploration. Due to uncertainties and variabilities of the environment, it is critical to learn a robust policy that performs well even when the deployed environment deviates from the nominal one used to collect the history dataset. We consider a distributionally robust formulation of offline RL, focusing on robust Markov decision processes with an uncertainty set specified by the Kullback-Leibler divergence in both finite-horizon and infinite-horizon settings.
arXiv Detail & Related papers (2022-08-11T11:55:31Z)
Offline Stochastic Shortest Path: Learning, Evaluation and Towards Optimality [57.91411772725183]
In this paper, we consider the offline shortest path problem when the state space and the action space are finite. We design the simple value-based algorithms for tackling both offline policy evaluation (OPE) and offline policy learning tasks. Our analysis of these simple algorithms yields strong instance-dependent bounds which can imply worst-case bounds that are near-minimax optimal.
arXiv Detail & Related papers (2022-06-10T07:44:56Z)
Reinforcement Learning with a Terminator [80.34572413850186]
We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds. We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret.
arXiv Detail & Related papers (2022-05-30T18:40:28Z)
Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes. We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm. textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z)
False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm. We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z)
Regret Analysis in Deterministic Reinforcement Learning [78.31410227443102]
We study the problem of regret, which is central to the analysis and design of optimal learning algorithms. We present logarithmic problem-specific regret lower bounds that explicitly depend on the system parameter.
arXiv Detail & Related papers (2021-06-27T23:41:57Z)
LSDAT: Low-Rank and Sparse Decomposition for Decision-based Adversarial Attack [74.5144793386864]
LSDAT crafts perturbations in the low-dimensional subspace formed by the sparse component of the input sample and that of an adversarial sample. LSD works directly in the image pixel domain to guarantee that non-$ell$ constraints, such as sparsity, are satisfied.
arXiv Detail & Related papers (2021-03-19T13:10:47Z)
A new interpretable unsupervised anomaly detection method based on residual explanation [47.187609203210705]
We present RXP, a new interpretability method to deal with the limitations for AE-based AD in large-scale systems. It stands out for its implementation simplicity, low computational cost and deterministic behavior. In an experiment using data from a real heavy-haul railway line, the proposed method achieved superior performance compared to SHAP.
arXiv Detail & Related papers (2021-03-14T15:35:45Z)
Sequential- and Parallel- Constrained Max-value Entropy Search via Information Lower Bound [9.09466320810472]
We focus on an information-theoretic approach called Max-value Entropy Search (MES) We propose a novel constrained BO method called Constrained Max-value Entropy Search via Information lower BOund (CMES-IBO)
arXiv Detail & Related papers (2021-02-19T08:10:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.