$\mathbf{(N,K)}$-Puzzle: A Cost-Efficient Testbed for Benchmarking
Reinforcement Learning Algorithms in Generative Language Model
- URL: http://arxiv.org/abs/2403.07191v1
- Date: Mon, 11 Mar 2024 22:24:14 GMT
- Title: $\mathbf{(N,K)}$-Puzzle: A Cost-Efficient Testbed for Benchmarking
Reinforcement Learning Algorithms in Generative Language Model
- Authors: Yufeng Zhang, Liyu Chen, Boyi Liu, Yingxiang Yang, Qiwen Cui, Yunzhe
Tao, Hongxia Yang
- Abstract summary: We present a generalized version of the 24-Puzzle: the $(N,K)$-Puzzle, which challenges language models to reach a target value $K$ with $N$ integers.
We evaluate the effectiveness of established RL algorithms such as Proximal Policy Optimization (PPO), alongside novel approaches like Identity Policy Optimization (IPO) and Direct Policy Optimization (DPO)
- Score: 50.636423457653066
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Recent advances in reinforcement learning (RL) algorithms aim to enhance the
performance of language models at scale. Yet, there is a noticeable absence of
a cost-effective and standardized testbed tailored to evaluating and comparing
these algorithms. To bridge this gap, we present a generalized version of the
24-Puzzle: the $(N,K)$-Puzzle, which challenges language models to reach a
target value $K$ with $N$ integers. We evaluate the effectiveness of
established RL algorithms such as Proximal Policy Optimization (PPO), alongside
novel approaches like Identity Policy Optimization (IPO) and Direct Policy
Optimization (DPO).
Related papers
- $f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization [91.43730624072226]
$f$-PO is a novel framework that generalizes and extends existing approaches.
We conduct experiments on state-of-the-art language models using benchmark datasets.
arXiv Detail & Related papers (2024-10-29T02:11:45Z) - e-COP : Episodic Constrained Optimization of Policies [12.854752753529151]
We present the first policy optimization algorithm for constrained Reinforcement Learning (RL) in episodic (finite horizon) settings.
We show that our algorithm has similar or better performance than SoTA (non-episodic) algorithms adapted for the episodic setting.
arXiv Detail & Related papers (2024-06-13T20:12:09Z) - Stochastic Q-learning for Large Discrete Action Spaces [79.1700188160944]
In complex environments with discrete action spaces, effective decision-making is critical in reinforcement learning (RL)
We present value-based RL approaches which, as opposed to optimizing over the entire set of $n$ actions, only consider a variable set of actions, possibly as small as $mathcalO(log(n)$)$.
The presented value-based RL methods include, among others, Q-learning, StochDQN, StochDDQN, all of which integrate this approach for both value-function updates and action selection.
arXiv Detail & Related papers (2024-05-16T17:58:44Z) - Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games [66.2085181793014]
We show that a model-free stage-based Q-learning algorithm can enjoy the same optimality in the $H$ dependence as model-based algorithms.
Our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions.
arXiv Detail & Related papers (2023-08-17T08:34:58Z) - Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting.
We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z) - An Empirical Study of Derivative-Free-Optimization Algorithms for
Targeted Black-Box Attacks in Deep Neural Networks [8.368543987898732]
This paper considers four pre-existing state-of-the-art DFO-based algorithms along with the introduction of a new algorithm built on BOBYQA.
We compare these algorithms in a variety of settings according to the fraction of images that they successfully misclassify.
Experiments disclose how the likelihood of finding an adversarial example depends on both the algorithm used and the setting of the attack.
arXiv Detail & Related papers (2020-12-03T13:32:20Z) - Provably Efficient Exploration in Policy Optimization [117.09887790160406]
This paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO)
OPPO achieves $tildeO(sqrtd2 H3 T )$ regret.
To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
arXiv Detail & Related papers (2019-12-12T08:40:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.