Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability
- URL: http://arxiv.org/abs/2601.18778v1
- Date: Mon, 26 Jan 2026 18:46:56 GMT
- Title: Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability
- Authors: Shobhita Sundaram, John Quan, Ariel Kwiatkowski, Kartik Ahuja, Yann Ollivier, Julia Kempe,
- Abstract summary: We show that it is possible to realize bi-level meta-RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful stepping stones.<n>Our results suggest that the ability to generate useful stepping stones does not require the ability to actually solve the hard problems.
- Score: 25.507069397981194
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Can a model learn to escape its own learning plateau? Reinforcement learning methods for finetuning large reasoning models stall on datasets with low initial success rates, and thus little training signal. We investigate a fundamental question: Can a pretrained LLM leverage latent knowledge to generate an automated curriculum for problems it cannot solve? To explore this, we design SOAR: A self-improvement framework designed to surface these pedagogical signals through meta-RL. A teacher copy of the model proposes synthetic problems for a student copy, and is rewarded with its improvement on a small subset of hard problems. Critically, SOAR grounds the curriculum in measured student progress rather than intrinsic proxy rewards. Our study on the hardest subsets of mathematical benchmarks (0/128 success) reveals three core findings. First, we show that it is possible to realize bi-level meta-RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful stepping stones. Second, grounded rewards outperform intrinsic reward schemes used in prior LLM self-play, reliably avoiding the instability and diversity collapse modes they typically exhibit. Third, analyzing the generated questions reveals that structural quality and well-posedness are more critical for learning progress than solution correctness. Our results suggest that the ability to generate useful stepping stones does not require the preexisting ability to actually solve the hard problems, paving a principled path to escape reasoning plateaus without additional curated data.
Related papers
- BEAGLE: Behavior-Enforced Agent for Grounded Learner Emulation [16.147318846582298]
Simulating student learning behaviors in open-ended problem-solving environments holds potential for education research.<n>However, collecting authentic data is challenging due to privacy concerns and the high cost of longitudinal studies.<n>We present BEAGLE, a neuro-symbolic framework that addresses this bias by incorporating Self-Regulated Learning (SRL) theory into a novel architecture.
arXiv Detail & Related papers (2026-02-06T08:05:15Z) - Stabilizing Reinforcement Learning for Honesty Alignment in Language Models on Deductive Reasoning [27.42733470720954]
We propose a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse.<n>Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance.
arXiv Detail & Related papers (2025-11-12T11:34:19Z) - The Path of Self-Evolving Large Language Models: Achieving Data-Efficient Learning via Intrinsic Feedback [51.144727949988436]
Reinforcement learning (RL) has demonstrated potential to enhance the reasoning capabilities of large language models (LLMs)<n>In this work, we explore improving LLMs through RL with minimal data.<n>To minimize data dependency, we introduce two novel mechanisms grounded in self-awareness.
arXiv Detail & Related papers (2025-10-03T06:32:10Z) - Nudging the Boundaries of LLM Reasoning [77.26972440427285]
Current online reinforcement learning algorithms cannot learn from problems that are "unsolvable" to the model.<n>We propose NuRL, a "nudging" method that aims to push the upper bound of LLM reasoning using self-generated hints.<n>NuRL achieves consistent improvements across 6 benchmarks and 3 models, while remaining complementary to test-time scaling.
arXiv Detail & Related papers (2025-09-30T02:01:40Z) - RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs? [92.4931695205957]
We introduce DELTA-Code, a benchmark of synthetic coding problem families designed to probe two fundamental aspects: learnability and transferrability.<n>Our experiments reveal a striking grokking phase transition: after an extended period with near-zero reward, RL-trained models abruptly climb to near-perfect accuracy.<n>To enable learnability on previously unsolvable problem families, we explore key training ingredients such as staged warm-up with dense rewards, experience replay, curriculum training, and verification-in-the-loop.
arXiv Detail & Related papers (2025-09-25T11:20:56Z) - Know When to Explore: Difficulty-Aware Certainty as a Guide for LLM Reinforcement Learning [37.20632079882874]
We introduce Difficulty Aware Certainty guided Exploration (DACE)<n>It balances the exploration exploitation trade-off based on the policys success rate.<n>Experiments on challenging mathematical reasoning benchmarks (AIME, MATH) show that DACE significantly outperforms strong baselines.
arXiv Detail & Related papers (2025-08-29T08:57:54Z) - Can Large Models Teach Student Models to Solve Mathematical Problems Like Human Beings? A Reasoning Distillation Method via Multi-LoRA Interaction [6.695255921627406]
Large Language Models (LLMs) have strong mathematical reasoning abilities but rely on hundreds of billions of parameters.<n>Existing methods typically leverage LLMs to generate massive amounts of data for cramming training.<n>We propose a novel method based on the multi-LoRA Interaction for mathematical reasoning Distillation (LoRID)<n>LoRID achieves state-of-the-art performance, especially on the GSM8K dataset.
arXiv Detail & Related papers (2025-08-18T15:56:10Z) - Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning [58.62311540316617]
We aim to improve the reasoning capabilities of language models via reinforcement learning (RL)<n>We propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually.<n>E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B)
arXiv Detail & Related papers (2025-06-07T02:41:54Z) - S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z) - LLM-based Cognitive Models of Students with Misconceptions [55.29525439159345]
This paper investigates whether Large Language Models (LLMs) can be instruction-tuned to meet this dual requirement.
We introduce MalAlgoPy, a novel Python library that generates datasets reflecting authentic student solution patterns.
Our insights enhance our understanding of AI-based student models and pave the way for effective adaptive learning systems.
arXiv Detail & Related papers (2024-10-16T06:51:09Z) - Unleash Model Potential: Bootstrapped Meta Self-supervised Learning [12.57396771974944]
Long-term goal of machine learning is to learn general visual representations from a small amount of data without supervision.
Self-supervised learning and meta-learning are two promising techniques to achieve this goal, but they both only partially capture the advantages.
We propose a novel Bootstrapped Meta Self-Supervised Learning framework that aims to simulate the human learning process.
arXiv Detail & Related papers (2023-08-28T02:49:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.