Related papers: Can an Easy-to-Hard Curriculum Make Reasoning Emerge in Small Language Models? Evidence from a Four-Stage Curriculum on GPT-2

Can an Easy-to-Hard Curriculum Make Reasoning Emerge in Small Language Models? Evidence from a Four-Stage Curriculum on GPT-2

URL: http://arxiv.org/abs/2505.11643v1
Date: Fri, 16 May 2025 19:08:31 GMT
Title: Can an Easy-to-Hard Curriculum Make Reasoning Emerge in Small Language Models? Evidence from a Four-Stage Curriculum on GPT-2
Authors: Xiang Fu,
Abstract summary: We demonstrate that a developmentally ordered curriculum markedly improves reasoning transparency and sample-efficiency in small language models.<n>We identify challenges: final-answer success still lags a conventional run by about 30%, and our saliency probe under-detects verbal-knowledge heads in the hardest stage.
Score: 0.8423417997128777
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We demonstrate that a developmentally ordered curriculum markedly improves reasoning transparency and sample-efficiency in small language models (SLMs). Concretely, we train Cognivolve, a 124 M-parameter GPT-2 model, on a four-stage syllabus that ascends from lexical matching to multi-step symbolic inference and then evaluate it without any task-specific fine-tuning. Cognivolve reaches target accuracy in half the optimization steps of a single-phase baseline, activates an order-of-magnitude more gradient-salient reasoning heads, and shifts those heads toward deeper layers, yielding higher-entropy attention that balances local and long-range context. The same curriculum applied out of order or with optimizer resets fails to reproduce these gains, confirming that progression--not extra compute--drives the effect. We also identify open challenges: final-answer success still lags a conventional run by about 30%, and our saliency probe under-detects verbal-knowledge heads in the hardest stage, suggesting directions for mixed-stage fine-tuning and probe expansion.

Related papers

Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO [24.91321958525287]
Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge.<n>Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable.<n>We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition.
arXiv Detail & Related papers (2026-02-05T05:27:11Z)
Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training [76.12556589212666]
We show that curriculum post-training avoids the exponential complexity bottleneck.<n>Under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with sample complexity.<n>We establish guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to order.
arXiv Detail & Related papers (2025-11-10T18:29:54Z)
Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning [50.20267980386502]
We learn a dense, token-level reward model for process supervision directly from expert demonstrations.<n>The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets.
arXiv Detail & Related papers (2025-10-02T09:55:26Z)
Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following [10.119219532863767]
lazy reasoning during the thinking stage is the primary factor contributing to poor instruction adherence.<n>We propose a comprehensive framework designed to enable rigorous reasoning processes involving preview and self-checking.<n>Our Light-IF-32B model surpasses both larger open-source models such as DeepSeek-R1 and closed-source models like Doubao-1.6.
arXiv Detail & Related papers (2025-08-05T07:42:00Z)
Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning [69.64809103333839]
We investigate how explicitly modeling problem's difficulty prior information shapes the effectiveness of reinforcement learning based fine-tuning for multimodal reasoning.<n>Our approach demonstrates significant performances across various multi-modal mathematical reasoning benchmarks with only 2K+0.6K two-stage training data.
arXiv Detail & Related papers (2025-05-19T15:43:10Z)
Improving Large Language Model Planning with Action Sequence Similarity [50.52049888490524]
In this work, we explore how to improve the model planning capability through in-context learning (ICL)<n>We propose GRASE-DC: a two-stage pipeline that first re-samples high AS exemplars and then curates the selected exemplars.<n>Our experimental result confirms that GRASE-DC achieves significant performance improvement on various planning tasks.
arXiv Detail & Related papers (2025-05-02T05:16:17Z)
Corrections Meet Explanations: A Unified Framework for Explainable Grammatical Error Correction [29.583603444317855]
We introduce EXGEC, a unified explainable GEC framework that integrates explanation and correction tasks in a generative manner.<n>Results on various NLP models (BART, T5, and Llama3) show that EXGEC models surpass single-task baselines in both tasks.
arXiv Detail & Related papers (2025-02-21T07:42:33Z)
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback [94.25162866972077]
Step-KTO is a training framework that combines process-level and outcome-level binary feedback.<n>Our experiments show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps.
arXiv Detail & Related papers (2025-01-18T15:38:03Z)
Multi-Step Reasoning in Korean and the Emergent Mirage [0.0]
We introduce HRMCR (HAE-RAE Multi-Step Commonsense Reasoning), a benchmark designed to evaluate large language models' ability to perform multi-step reasoning in culturally specific contexts.<n>The questions are automatically generated via templates and algorithms, requiring LLMs to integrate Korean cultural knowledge into sequential reasoning steps.<n>Our experiments reveal that models trained on fewer than (2 cdot 1025) training FLOPs struggle to solve any questions, showing near-zero performance.
arXiv Detail & Related papers (2025-01-10T05:07:27Z)
A Systematic Examination of Preference Learning through the Lens of Instruction-Following [83.71180850955679]
We use a novel synthetic data generation pipeline to generate 48,000 instruction unique-following prompts.<n>With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS)<n>Experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements.<n>High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance.
arXiv Detail & Related papers (2024-12-18T15:38:39Z)
Improving Discrete Optimisation Via Decoupled Straight-Through Gumbel-Softmax [4.427325225595673]
We show that our approach significantly enhances the original ST-GS through extensive experiments across multiple tasks and datasets. Our findings contribute to the ongoing effort to improve discrete optimization in deep learning.
arXiv Detail & Related papers (2024-10-17T08:44:57Z)
Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models [104.55763564037831]
We train a regression model that leverages attention maps, probabilities on the current generation step, and recurrently computed uncertainty scores from previously generated tokens.<n>Our evaluation shows that the proposed method is highly effective for selective generation, achieving substantial improvements over rivaling unsupervised and supervised approaches.
arXiv Detail & Related papers (2024-08-20T09:42:26Z)
Visual Prompt Tuning in Null Space for Continual Learning [51.96411454304625]
Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL) This paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features. In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient projection.
arXiv Detail & Related papers (2024-06-09T05:57:40Z)
Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection [73.31406286956535]
We introduce the Ladder-of-Thought (LoT) for the stance detection task. LoT directs the small LMs to assimilate high-quality external knowledge, refining the intermediate rationales produced. Our empirical evaluations underscore LoT's efficacy, marking a 16% improvement over GPT-3.5 and a 10% enhancement compared to GPT-3.5 with CoT on stance detection task.
arXiv Detail & Related papers (2023-08-31T14:31:48Z)
Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation and Detection of Problematic Content [0.0]
We tackle the emerging challenge of unintended harmful content generation in Large Language Models (LLMs) Our two-pronged approach employs an adversarial model, fine-tuned to generate potentially harmful prompts, and a judge model, iteratively optimised to discern these prompts. We show that a rudimentary model textttada can achieve 13% higher accuracy on the hold-out test set than GPT-4 after only a few rounds of this process.
arXiv Detail & Related papers (2023-08-26T05:20:58Z)
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines. In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.