Related papers: $\mathbf{(N,K)}$-Puzzle: A Cost-Efficient Testbed for Benchmarking Reinforcement Learning Algorithms in Generative Language Model

$\mathbf{(N,K)}$-Puzzle: A Cost-Efficient Testbed for Benchmarking Reinforcement Learning Algorithms in Generative Language Model

URL: http://arxiv.org/abs/2403.07191v1
Date: Mon, 11 Mar 2024 22:24:14 GMT
Title: $\mathbf{(N,K)}$-Puzzle: A Cost-Efficient Testbed for Benchmarking Reinforcement Learning Algorithms in Generative Language Model
Authors: Yufeng Zhang, Liyu Chen, Boyi Liu, Yingxiang Yang, Qiwen Cui, Yunzhe Tao, Hongxia Yang
Abstract summary: We present a generalized version of the 24-Puzzle: the $(N,K)$-Puzzle, which challenges language models to reach a target value $K$ with $N$ integers. We evaluate the effectiveness of established RL algorithms such as Proximal Policy Optimization (PPO), alongside novel approaches like Identity Policy Optimization (IPO) and Direct Policy Optimization (DPO)
Score: 50.636423457653066
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Recent advances in reinforcement learning (RL) algorithms aim to enhance the performance of language models at scale. Yet, there is a noticeable absence of a cost-effective and standardized testbed tailored to evaluating and comparing these algorithms. To bridge this gap, we present a generalized version of the 24-Puzzle: the $(N,K)$-Puzzle, which challenges language models to reach a target value $K$ with $N$ integers. We evaluate the effectiveness of established RL algorithms such as Proximal Policy Optimization (PPO), alongside novel approaches like Identity Policy Optimization (IPO) and Direct Policy Optimization (DPO).

Related papers

On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z)
A*-Decoding: Token-Efficient Inference Scaling [0.0]
Inference-time scaling has emerged as a powerful alternative to parameter scaling for improving language model performance.<n>We introduce A*-decoding, a search-based inference-time strategy that builds on the A* search algorithm to optimally utilize a fixed compute budget.<n>Our work demonstrates how thoughtful inference-time strategies can enhance reasoning in SLMs, pointing toward future advances in more efficient and scalable language model deployment.
arXiv Detail & Related papers (2025-05-19T19:19:48Z)
RL-finetuning LLMs from on- and off-policy data with a single algorithm [53.70731390624718]
We introduce a novel reinforcement learning algorithm (AGRO) for fine-tuning large-language models. AGRO leverages the concept of generation consistency, which states that the optimal policy satisfies the notion of consistency across any possible generation of the model. We derive algorithms that find optimal solutions via the sample-based policy gradient and provide theoretical guarantees on their convergence.
arXiv Detail & Related papers (2025-03-25T12:52:38Z)
Autoformulation of Mathematical Optimization Models Using LLMs [50.030647274271516]
This paper approaches the problem of $textitautoformulation$: the automated creation of solver-ready optimization models from natural language problem descriptions.<n>We identify three core challenges of autoformulation: $textit(1)$ the vast, problem-dependent hypothesis space, and $textit(2)$ efficient and diverse exploration of this space under uncertainty.<n>We present a novel method leveraging $textitLarge Language Models$ with $textitMonte-Carlo Tree Search$, exploiting the hierarchical nature of optimization modeling to generate and systematically explore possible formulations
arXiv Detail & Related papers (2024-11-03T20:41:38Z)
$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization [91.43730624072226]
$f$-PO is a novel framework that generalizes and extends existing approaches. We conduct experiments on state-of-the-art language models using benchmark datasets.
arXiv Detail & Related papers (2024-10-29T02:11:45Z)
e-COP : Episodic Constrained Optimization of Policies [12.854752753529151]
We present the first policy optimization algorithm for constrained Reinforcement Learning (RL) in episodic (finite horizon) settings. We show that our algorithm has similar or better performance than SoTA (non-episodic) algorithms adapted for the episodic setting.
arXiv Detail & Related papers (2024-06-13T20:12:09Z)
Stochastic Q-learning for Large Discrete Action Spaces [79.1700188160944]
In complex environments with discrete action spaces, effective decision-making is critical in reinforcement learning (RL) We present value-based RL approaches which, as opposed to optimizing over the entire set of $n$ actions, only consider a variable set of actions, possibly as small as $mathcalO(log(n)$)$. The presented value-based RL methods include, among others, Q-learning, StochDQN, StochDDQN, all of which integrate this approach for both value-function updates and action selection.
arXiv Detail & Related papers (2024-05-16T17:58:44Z)
Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games [66.2085181793014]
We show that a model-free stage-based Q-learning algorithm can enjoy the same optimality in the $H$ dependence as model-based algorithms. Our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions.
arXiv Detail & Related papers (2023-08-17T08:34:58Z)
Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting. We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z)
An Empirical Study of Derivative-Free-Optimization Algorithms for Targeted Black-Box Attacks in Deep Neural Networks [8.368543987898732]
This paper considers four pre-existing state-of-the-art DFO-based algorithms along with the introduction of a new algorithm built on BOBYQA. We compare these algorithms in a variety of settings according to the fraction of images that they successfully misclassify. Experiments disclose how the likelihood of finding an adversarial example depends on both the algorithm used and the setting of the attack.
arXiv Detail & Related papers (2020-12-03T13:32:20Z)
Provably Efficient Exploration in Policy Optimization [117.09887790160406]
This paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO) OPPO achieves $tildeO(sqrtd2 H3 T )$ regret. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
arXiv Detail & Related papers (2019-12-12T08:40:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.