What is the objective of reasoning with reinforcement learning?
- URL: http://arxiv.org/abs/2510.13651v1
- Date: Wed, 15 Oct 2025 15:13:38 GMT
- Title: What is the objective of reasoning with reinforcement learning?
- Authors: Damek Davis, Benjamin Recht,
- Abstract summary: We show that several popular algorithms for reinforcement learning in large language models with binary rewards can be viewed as a gradient ascent on a monotone transform of the probability of a correct answer given a prompt.
- Score: 7.728587479013023
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We show that several popular algorithms for reinforcement learning in large language models with binary rewards can be viewed as stochastic gradient ascent on a monotone transform of the probability of a correct answer given a prompt. In particular, the transformation associated with rejection sampling algorithms is the logarithm and that associated with the GRPO algorithm is the arcsine of the square root.
Related papers
- Latent Chain-of-Thought for Visual Reasoning [53.541579327424046]
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs)<n>We reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.<n>We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T23:10:06Z) - The Gradient of Algebraic Model Counting [9.742948699856427]
We show that the very same semiring perspective of algebraic model counting also applies to learning.<n>We show how cancellation and ordering properties of a semiring can be exploited for more memory-efficient backpropagation.
arXiv Detail & Related papers (2025-02-25T17:57:55Z) - The ODE Method for Stochastic Approximation and Reinforcement Learning with Markovian Noise [17.493808856903303]
One fundamental challenge in analyzing an approximation algorithm is to establish its stability.<n>We extend the celebrated Borkar-Meyn theorem for stability bounded from the Martingale difference noise setting to the Markovian noise setting.
arXiv Detail & Related papers (2024-01-15T17:20:17Z) - GFN-SR: Symbolic Regression with Generative Flow Networks [0.9208007322096533]
In recent years, deep symbolic regression (DSR) has emerged as a popular method in the field.
We propose an alternative framework (GFN-SR) to approach SR with deep learning.
GFN-SR is capable of generating a diverse set of best-fitting expressions.
arXiv Detail & Related papers (2023-12-01T07:38:05Z) - Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - Generalizing Backpropagation for Gradient-Based Interpretability [103.2998254573497]
We show that the gradient of a model is a special case of a more general formulation using semirings.
This observation allows us to generalize the backpropagation algorithm to efficiently compute other interpretable statistics.
arXiv Detail & Related papers (2023-07-06T15:19:53Z) - Stochastic Unrolled Federated Learning [85.6993263983062]
We introduce UnRolled Federated learning (SURF), a method that expands algorithm unrolling to federated learning.
Our proposed method tackles two challenges of this expansion, namely the need to feed whole datasets to the unrolleds and the decentralized nature of federated learning.
arXiv Detail & Related papers (2023-05-24T17:26:22Z) - Minimizing the Outage Probability in a Markov Decision Process [0.0]
We propose an algorithm which enables to optimize an alternative objective: the probability that the gain is greater than a given value.
The proposed algorithm can be seen as an extension of the value iteration algorithm.
arXiv Detail & Related papers (2023-02-28T16:26:23Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Bayesian Recurrent Units and the Forward-Backward Algorithm [91.39701446828144]
Using Bayes's theorem, we derive a unit-wise recurrence as well as a backward recursion similar to the forward-backward algorithm.
The resulting Bayesian recurrent units can be integrated as recurrent neural networks within deep learning frameworks.
Experiments on speech recognition indicate that adding the derived units at the end of state-of-the-art recurrent architectures can improve the performance at a very low cost in terms of trainable parameters.
arXiv Detail & Related papers (2022-07-21T14:00:52Z) - Learning Non-Vacuous Generalization Bounds from Optimization [8.294831479902658]
We present a simple yet non-vacuous generalization bound from the optimization perspective.
We achieve this goal by leveraging that the hypothesis set accessed by gradient algorithms is essentially fractal-like.
Numerical studies demonstrate that our approach is able to yield plausible generalization guarantees for modern neural networks.
arXiv Detail & Related papers (2022-06-09T08:59:46Z) - Random-reshuffled SARAH does not need a full gradient computations [61.85897464405715]
The StochAstic Recursive grAdientritHm (SARAH) algorithm is a variance reduced variant of the Gradient Descent (SGD) algorithm.
In this paper, we remove the necessity of a full gradient.
The aggregated gradients serve as an estimate of a full gradient in the SARAH algorithm.
arXiv Detail & Related papers (2021-11-26T06:00:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.