Related papers: $\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space

$\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space

URL: http://arxiv.org/abs/2603.04948v1
Date: Thu, 05 Mar 2026 08:42:54 GMT
Title: $\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space
Authors: Peihao Wang, Ruisi Cai, Zhen Wang, Hongyuan Mei, Qiang Liu, Pan Li, Zhangyang Wang,
Abstract summary: $nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
Score: 71.23672814629448
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling inference-time compute for Large Language Models (LLMs) has unlocked unprecedented reasoning capabilities. However, existing inference-time scaling methods typically rely on inefficient and suboptimal discrete search algorithms or trial-and-error prompting to improve the online policy. In this paper, we propose $\nabla$-Reasoner, an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop to refine the policy on the fly. Our core component, Differentiable Textual Optimization (DTO), leverages gradient signals from both the LLM's likelihood and a reward model to refine textual representations. $\nabla$-Reasoner further incorporates rejection sampling and acceleration design to robustify and speed up decoding. Theoretically, we show that performing inference-time gradient descent in the sample space to maximize reward is dual to aligning an LLM policy via KL-regularized reinforcement learning. Empirically, $\nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark, while reducing number of model calls by approximately 10-40% compared to strong baselines. Overall, our work introduces a paradigm shift from zeroth-order search to first-order optimization at test time, offering a cost-effective path to amplify LLM reasoning.

Related papers

Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search [29.662680998899294]
We introduce textscGome, an MLE agent that operationalizes gradient-based optimization.<n>textscGome achieves a state-of-the-art 35.1% any-medal rate on MLE-Bench with a restricted 12-hour budget on a single V100 GPU.
arXiv Detail & Related papers (2026-03-02T10:22:47Z)
Zero-Order Optimization for LLM Fine-Tuning via Learnable Direction Sampling [40.94400211806987]
We propose a policy-driven ZO framework that treats the sampling distribution over perturbation directions as a learnable policy.<n>We show that learned sampling improves quality gradient information and relax the explicit dependence on $d$ in convergence bounds.<n>Our results suggest that adaptive direction sampling is a promising route to make ZO fine-tuning viable at scale.
arXiv Detail & Related papers (2026-02-14T08:01:41Z)
Predicting and improving test-time scaling laws via reward tail-guided search [11.49701649103495]
Test-time scaling has emerged as a critical avenue for enhancing the reasoning capabilities of Large Language Models.<n>We propose new methodologies to predict and improve scaling properties via tail-guided search.<n>By estimating the tail distribution of rewards, our method predicts the scaling law of LLMs without the need for exhaustive evaluations.
arXiv Detail & Related papers (2026-02-01T23:40:25Z)
Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model [98.35868970993232]
Diffusion language models (DLMs) are emerging as a powerful and promising alternative to the dominant autoregressive paradigm.<n>We introduce efficient Sampling with Adaptive acceleration and Backtracking Enhanced Remasking (i.e., Saber) to achieve better inference speed and output quality in code generation.
arXiv Detail & Related papers (2025-10-20T23:38:12Z)
Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models? [65.18157595903124]
This work investigates iterative approximate evaluation for arbitrary prompts.<n>It introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework.<n>MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced rollouts.
arXiv Detail & Related papers (2025-07-07T03:20:52Z)
$\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts [55.231201692232894]
$textttSPECS$ is a latency-aware test-time scaling method inspired by speculative decoding.<n>Our results show that $textttSPECS$matches or surpasses beam search accuracy while reducing latency by up to $sim$19.1%.
arXiv Detail & Related papers (2025-06-15T05:50:05Z)
OAT-Rephrase: Optimization-Aware Training Data Rephrasing for Zeroth-Order LLM Fine-Tuning [25.76983801886268]
This paper introduces OAT-Rephrase, an Optimization-Aware Training data rephrasing strategy.<n>We show that OAT-Rephrase consistently improves MeZO fine-tuning performance.<n>Our findings suggest that optimization-aware rephrasing serves as a reusable and low-overhead enhancement for zeroth-order tuning regimes.
arXiv Detail & Related papers (2025-06-10T02:53:04Z)
Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z)
Reinforced Latent Reasoning for LLM-based Recommendation [92.56166822197919]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks.<n>Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data.<n>In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning.
arXiv Detail & Related papers (2025-05-25T11:03:45Z)
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [48.28847964704554]
Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks.<n>We propose a novel approach for continuous-space reasoning that does not require modifying the LLM.
arXiv Detail & Related papers (2025-02-17T18:52:29Z)
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [63.10833446782114]
As language models grow in size, memory demands for backpropagation increase.<n>Zeroth-order (ZO) optimization methods offer a memory-efficient alternative.<n>In this paper, we propose Subspace Zero-order optimization to address the challenges posed by posed by high dimensionality perturbations.
arXiv Detail & Related papers (2024-10-11T17:01:43Z)
Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models. We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z)
SHOT: Suppressing the Hessian along the Optimization Trajectory for Gradient-Based Meta-Learning [28.26143547479141]
We introduce an algorithm called SHOT (Suppressing the Hessian along the Optimization Trajectory) SHOT does not increase the computational complexity of the baseline model much. We confirm our hypothesis empirically and demonstrate that SHOT outperforms the corresponding baseline.
arXiv Detail & Related papers (2023-10-04T11:43:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.