Related papers: Enhancing Reasoning Capabilities in SLMs with Reward Guided Dataset Distillation

Enhancing Reasoning Capabilities in SLMs with Reward Guided Dataset Distillation

URL: http://arxiv.org/abs/2507.00054v1
Date: Wed, 25 Jun 2025 20:07:47 GMT
Title: Enhancing Reasoning Capabilities in SLMs with Reward Guided Dataset Distillation
Authors: Shreyansh Padarha,
Abstract summary: We propose AdvDistill, a reward-guided dataset distillation framework.<n>We utilise multiple generations (responses) from a teacher for each prompt and assign rewards based on rule-based verifiers.<n>These varying and normally distributed rewards serve as weights when training student models.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The push to compress and impart the proficiency of Large Language Models (LLMs) into more deployable and efficient Small Language Models (SLMs) has benefited from improvements in knowledge distillation (KD) techniques. These techniques allow a smaller student model to learn from a more capable and larger teacher model's responses. However, distillation often revolves around the student model merely copying the teacher's in-distribution responses, limiting its generalisability. This limitation is amplified on reasoning tasks and can be computationally expensive. In this study, we propose AdvDistill, a reward-guided dataset distillation framework. We utilise multiple generations (responses) from a teacher for each prompt and assign rewards based on rule-based verifiers. These varying and normally distributed rewards serve as weights when training student models. Our methods and their subsequent behavioural analysis demonstrate a significant improvement in student model performance for mathematical and complex reasoning tasks, showcasing the efficacy and benefits of incorporating a rewarding mechanism in dataset distillation processes.

Related papers

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation [57.524909883706556]
On-policy distillation (OPD) has demonstrated strong empirical gains in improving student performance.<n>This work introduces a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization.<n>In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain-specific RL to the same student model, ExOPD enables the student to even surpass the teacher's performance boundary.
arXiv Detail & Related papers (2026-02-12T16:14:29Z)
Honey, I Shrunk the Language Model: Impact of Knowledge Distillation Methods on Performance and Explainability [3.224880576815583]
High computational and storage demands of Large Language Models limit their deployment in resource-constrained environments.<n>Previous research has introduced several distillation methods for both generating training data and for training the student model.<n>Despite their relevance, the effects of state-of-the-art distillation methods on model performance and explainability have not been thoroughly investigated.
arXiv Detail & Related papers (2025-04-22T17:32:48Z)
Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
UNDO: Understanding Distillation as Optimization [9.100811514331498]
We introduce the UNDO: UNderstanding Distillation as Optimization framework.<n>Each iteration directly targets the student's learning deficiencies, motivating the teacher to provide tailored and enhanced rationales.<n> Empirical evaluations on various challenging mathematical and commonsense reasoning tasks demonstrate that our iterative distillation method, UNDO, significantly outperforms standard one-step distillation methods.
arXiv Detail & Related papers (2025-04-03T12:18:51Z)
Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones? [58.80794196076336]
Distilling large language models (LLMs) typically involves transferring the teacher model's responses through supervised fine-tuning (SFT)<n>We propose a novel distillation pipeline that transfers both responses and rewards.<n>Our method generates pseudo-rewards through a self-supervised mechanism that leverages the inherent structure of both teacher and student responses.
arXiv Detail & Related papers (2025-02-26T20:50:11Z)
Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z)
Distilling Robustness into Natural Language Inference Models with Domain-Targeted Augmentation [12.512147282842175]
We investigate two complementary methods for improving the robustness of the resulting student models on out-of-distribution domains. The first approach augments the distillation with generated unlabelled examples that match the target distribution. The second method upsamples data points among the training set that are similar to the target distribution.
arXiv Detail & Related papers (2023-05-22T14:37:05Z)
EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR) We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z)
Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation [55.34995029082051]
We propose a method to learn to augment for data-scarce domain BERT knowledge distillation. We show that the proposed method significantly outperforms state-of-the-art baselines on four different tasks.
arXiv Detail & Related papers (2021-01-20T13:07:39Z)
Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression. Current methods assign a fixed weight to a teacher model in the whole distillation. Most of the existing methods allocate an equal weight to every teacher model. In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.