The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models
- URL: http://arxiv.org/abs/2510.06101v1
- Date: Tue, 07 Oct 2025 16:32:09 GMT
- Title: The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models
- Authors: Muyu He, Muhammad Ali Shafique, Anand Kumar, Tsach Mackey, Nazneen Rajani,
- Abstract summary: We study the scaling trend of distilling competitive coding skills on two small non-reasoning LLMs.<n>We find that, surprisingly, the correctness of outputs in training data makes no difference to distillation outcomes.
- Score: 40.900241869345976
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Distilling the thinking traces of a Large Language Model (LLM) with reasoning capabilities into a smaller model has been proven effective. Yet, there is a scarcity of work done on how model performances scale with the quantity of distillation data. In this work, we study the scaling trend of distilling competitive coding skills on two small non-reasoning LLMs. We validate the hypothesis that there is a $\textit{valley of code reasoning}$: downstream performance on competitive coding first drops as data quantity increases, then it steadily increases in a sharper-than-log-linear fashion. Having identified the trend, we further fine-tune the models at two different distillation stages on the same data to ground conclusions on their respective learning phases. We learn that across stages in the low and medium-low data regimes, small models benefit significantly from easier coding questions than from harder ones. We also find that, surprisingly, the correctness of outputs in training data makes no difference to distillation outcomes. Our work represents a step forward in understanding the training dynamics of code reasoning distillation outside intuition
Related papers
- Extracting alignment data in open models [50.81383232591576]
We show that it is possible to extract significant amounts of alignment training data from a post-trained model.<n>This data is useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths.<n>We find that models readily regurgitate training data that was used in post-training phases such as SFT or RL.
arXiv Detail & Related papers (2025-10-21T12:06:00Z) - TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression [55.37723860832064]
We propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations.<n>We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels.
arXiv Detail & Related papers (2025-06-03T09:23:41Z) - Honey, I Shrunk the Language Model: Impact of Knowledge Distillation Methods on Performance and Explainability [3.224880576815583]
High computational and storage demands of Large Language Models limit their deployment in resource-constrained environments.<n>Previous research has introduced several distillation methods for both generating training data and for training the student model.<n>Despite their relevance, the effects of state-of-the-art distillation methods on model performance and explainability have not been thoroughly investigated.
arXiv Detail & Related papers (2025-04-22T17:32:48Z) - OpenCodeReasoning: Advancing Data Distillation for Competitive Coding [61.15402517835137]
We build a supervised fine-tuning (SFT) dataset to achieve state-of-the-art coding capability results in models of various sizes.<n>Our models use only SFT to achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, surpassing alternatives trained with reinforcement learning.
arXiv Detail & Related papers (2025-04-02T17:50:31Z) - S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z) - PaD: Program-aided Distillation Can Teach Small Models Reasoning Better than Chain-of-thought Fine-tuning [20.59775450213501]
We propose Program-aided Distillation (PaD), which introduces reasoning programs to suppress the errors in distilled data.
We evaluate PaD on arithmetic reasoning, symbolic reasoning, and general ability.
arXiv Detail & Related papers (2023-05-23T10:11:56Z) - Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z) - Data Contamination: From Memorization to Exploitation [5.997909991352044]
It is not clear to what extent models exploit contaminated data for downstream tasks.
We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task.
Experiments with two models and three downstream tasks show that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it.
arXiv Detail & Related papers (2022-03-15T20:37:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.