Related papers: Reasoning without Regret

Reasoning without Regret

URL: http://arxiv.org/abs/2504.09777v1
Date: Mon, 14 Apr 2025 00:34:20 GMT
Title: Reasoning without Regret
Authors: Tarun Chitra,
Abstract summary: We introduce emphBackwards Adaptive Reward Shaping (BARS), a no-regret framework that converts sparse outcomes-based rewards into effective procedure-based signals.<n>Our analysis, based on generic chaining, continuous scaling limits, and non-linear Feynman-Kac bounds, connects recent outcome-based methods' empirical successes with the benefits of intermediate supervision.
Score: 4.07926531936425
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chain-of-thought reasoning enables large language models to solve multi-step tasks by framing problem solving as sequential decision problems. Outcome-based rewards, which provide feedback only on final answers, show impressive success, but face challenges with credit assignment and slow convergence. In contrast, procedure-based rewards offer efficient step-level feedback, but typically require costly human supervision. We introduce \emph{Backwards Adaptive Reward Shaping} (BARS), a no-regret framework that converts sparse outcomes-based rewards into effective procedure-based signals. BARS uses sparse rewards generated from terminal-state priors and cover trees to scale rewards while preventing exploitation. With Bellman contraction and $(\Delta, \epsilon)$-gap rewards, our backward Euler solver achieves $\epsilon$-accuracy in $O\left((R_{\max}/\Delta)\log(1/\epsilon)\right)$ iterations with $O(\log T)$ dynamic regret over $T$ rounds. Our analysis, based on generic chaining, continuous scaling limits, and non-linear Feynman-Kac bounds, connects recent outcome-based methods' empirical successes with the benefits of intermediate supervision. Combined, this provides the first rigorous no-regret algorithm for outcome reward shaping, providing a theoretical foundation for the empirical success of DeepSeek's R1.

Related papers

Optimistically Optimistic Exploration for Provably Efficient Infinite-Horizon Reinforcement and Imitation Learning [13.429541377715296]
We propose the first computationally efficient algorithm achieving near-optimal regret guarantees in infinite-horizon discounted linear Markov decision processes.<n>We show that, combined with a regularized approximate dynamic-programming scheme, the resulting algorithm achieves a regret of order $tildemathcalO (sqrtd3 (1 - gamma)- 7 / 2 T)$, where $T$ is the total number of sample transitions, $gamma in (0,1)$ is the discount factor, and $d$ is the feature dimensionality.
arXiv Detail & Related papers (2025-02-19T17:32:35Z)
Walking the Values in Bayesian Inverse Reinforcement Learning [66.68997022043075]
Key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood. We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight.
arXiv Detail & Related papers (2024-07-15T17:59:52Z)
Cost Aware Best Arm Identification [13.380383930882784]
We call it emphCost Aware Best Arm Identification (CABAI) We propose a simple algorithm called emphChernoff Overlap (CO), based on a square-root rule. Our results show that (i.e. ignoring the heterogeneous action cost results in sub-optimality in practice, and (ii.) simple algorithms can deliver near-optimal performance over a wide range of problems.
arXiv Detail & Related papers (2024-02-26T16:27:08Z)
STARC: A General Framework For Quantifying Differences Between Reward Functions [52.69620361363209]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics.<n>We show that STARC metrics induce both an upper and a lower bound on worst-case regret.<n>We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z)
Towards Theoretical Understanding of Inverse Reinforcement Learning [45.3190496371625]
Inverse reinforcement learning (IRL) denotes a powerful family of algorithms for recovering a reward function justifying the behavior demonstrated by an expert agent. In this paper, we make a step towards closing the theory gap of IRL in the case of finite-horizon problems with a generative model.
arXiv Detail & Related papers (2023-04-25T16:21:10Z)
Best of Both Worlds Policy Optimization [33.13041034490332]
We show that by properly designing the regularizer, the exploration bonus and the learning rates, one can achieve a more favorable polylog$(T)$ regret when the losses are adversarial. This is the first time a gap-dependent polylog$(T)$ regret bound is shown for policy optimization.
arXiv Detail & Related papers (2023-02-18T19:46:11Z)
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED) PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z)
A Robust Phased Elimination Algorithm for Corruption-Tolerant Gaussian Process Bandits [118.22458816174144]
We propose a novel robust elimination-type algorithm that runs in epochs, combines exploration with infrequent switching to select a small subset of actions, and plays each action for multiple time instants. Our algorithm, GP Robust Phased Elimination (RGP-PE), successfully balances robustness to corruptions with exploration and exploitation. We perform the first empirical study of robustness in the corrupted GP bandit setting, and show that our algorithm is robust against a variety of adversarial attacks.
arXiv Detail & Related papers (2022-02-03T21:19:36Z)
Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off. We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus. We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z)
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning. The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.