RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs?
- URL: http://arxiv.org/abs/2509.21016v2
- Date: Fri, 03 Oct 2025 23:19:39 GMT
- Title: RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs?
- Authors: Yiyou Sun, Yuhan Cao, Pohao Huang, Haoyue Bai, Hannaneh Hajishirzi, Nouha Dziri, Dawn Song,
- Abstract summary: We introduce DELTA-Code, a benchmark of synthetic coding problem families designed to probe two fundamental aspects: learnability and transferrability.<n>Our experiments reveal a striking grokking phase transition: after an extended period with near-zero reward, RL-trained models abruptly climb to near-perfect accuracy.<n>To enable learnability on previously unsolvable problem families, we explore key training ingredients such as staged warm-up with dense rewards, experience replay, curriculum training, and verification-in-the-loop.
- Score: 92.4931695205957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It remains an open question whether LLMs can acquire or generalize genuinely new reasoning strategies, beyond the sharpened skills encoded in their parameters during pre-training or post-training. To attempt to answer this debate, we introduce DELTA-Code -- Distributional Evaluation of Learnability and Transferrability in Algorithmic Coding -- a controlled benchmark of synthetic coding problem families designed to probe two fundamental aspects: learnability -- can LLMs, through reinforcement learning (RL), solve problem families where pretrained models exhibit failure with large enough attempts (pass@K=0)? -- and transferrability -- if learnability happens, can such skills transfer systematically to out-of-distribution (OOD) test sets? Unlike prior public coding datasets, DELTA isolates reasoning skills through templated problem generators and introduces fully OOD problem families that demand novel strategies rather than tool invocation or memorized patterns. Our experiments reveal a striking grokking phase transition: after an extended period with near-zero reward, RL-trained models abruptly climb to near-perfect accuracy. To enable learnability on previously unsolvable problem families, we explore key training ingredients such as staged warm-up with dense rewards, experience replay, curriculum training, and verification-in-the-loop. Beyond learnability, we use DELTA to evaluate transferability or generalization along exploratory, compositional, and transformative axes, as well as cross-family transfer. Results show solid gains within families and for recomposed skills, but persistent weaknesses in transformative cases. DELTA thus offers a clean testbed for probing the limits of RL-driven reasoning and for understanding how models can move beyond existing priors to acquire new algorithmic skills.
Related papers
- Tailored Primitive Initialization is the Secret Key to Reinforcement Learning [61.29280885291581]
Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs)<n>We argue that initializing LLMs with diverse, high-quality reasoning primitives is essential for achieving stable and sample-efficient RL training.<n>We propose Tailor, a finetuning pipeline that automatically discovers and curates novel reasoning primitives.
arXiv Detail & Related papers (2025-11-16T03:12:40Z) - From $f(x)$ and $g(x)$ to $f(g(x))$: LLMs Learn New Skills in RL by Composing Old Ones [68.68686526804909]
We show that LLMs can acquire genuinely new skills during RL by composing existing ones.<n>Our experiments show that compositional skill acquired on a source task transfers to a different target task.<n>This transfer happens even without compositional training on the target, requiring only prior knowledge of the target's atomic skills.
arXiv Detail & Related papers (2025-09-29T17:44:27Z) - Learning Distinguishable Representations in Deep Q-Networks for Linear Transfer [0.9558392439655014]
We propose a novel deep Q-learning approach that introduces a regularization term to reduce positive correlations between feature representation of states.<n>We demonstrate the efficacy of our approach in improving transfer learning performance and thereby reducing computational overhead.
arXiv Detail & Related papers (2025-09-29T15:44:35Z) - RL for Reasoning by Adaptively Revealing Rationales [36.50924054394857]
Supervised fine-tuning (SFT) relies on dense ground-truth labels, which become increasingly costly as sequence length grows.<n>We address this by adaptive backtracking (AdaBack), a per-sample curriculum learning algorithm that reveals only a partial prefix of the target output during training.<n>We show that our adaptive curriculum over partial answers reliably solves problems that are otherwise intractable.
arXiv Detail & Related papers (2025-06-22T17:46:14Z) - Reinforcement Learning Teachers of Test Time Scaling [21.551446057221185]
Key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations.<n>We introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs)<n>RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students.
arXiv Detail & Related papers (2025-06-10T02:53:24Z) - Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions [28.962415274754537]
Large language model (LLM) reasoning has shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL)<n>We introduce a novel training approach, textbfReLIFT (textbfReinforcement textbfL textbfInterleaved with Online textbfFine-textbfTuning)<n>In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternate
arXiv Detail & Related papers (2025-06-09T08:11:20Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.<n>Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.<n>Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z) - Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [23.99454995087634]
We explore the potential of rule-based reinforcement learning in large reasoning models.<n>We use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification.<n>Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus.
arXiv Detail & Related papers (2025-02-20T17:49:26Z) - Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning [62.984693936073974]
Value-based reinforcement learning can learn effective policies for a wide range of multi-turn problems.<n>Current value-based RL methods have proven particularly challenging to scale to the setting of large language models.<n>We propose a novel offline RL algorithm that addresses these drawbacks, casting Q-learning as a modified supervised fine-tuning problem.
arXiv Detail & Related papers (2024-11-07T21:36:52Z) - RLSF: Fine-tuning LLMs via Symbolic Feedback [11.407319705797242]
Large Language Models (LLMs) have transformed AI but often struggle with tasks that require domain-specific reasoning and logical alignment.<n>Traditional fine-tuning methods do not leverage the vast amount of symbolic domain-knowledge available to us.<n>We introduce Reinforcement Learning via Symbolic Feedback (RLSF), a novel fine-tuning paradigm.
arXiv Detail & Related papers (2024-05-26T18:49:59Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.