Small Models Struggle to Learn from Strong Reasoners
- URL: http://arxiv.org/abs/2502.12143v1
- Date: Mon, 17 Feb 2025 18:56:15 GMT
- Title: Small Models Struggle to Learn from Strong Reasoners
- Authors: Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran,
- Abstract summary: Small models do not consistently benefit from long chain-of-thought reasoning or distillation from larger models.
We propose Mix Distillation, a strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models.
Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone.
- Score: 14.895026967556088
- License:
- Abstract: Large language models (LLMs) excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the Small Model Learnability Gap: small models ($\leq$3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, they perform better when fine-tuned on shorter, simpler reasoning chains that better align with their intrinsic learning capacity. To address this, we propose Mix Distillation, a simple yet effective strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models. Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone. These findings highlight the limitations of direct strong model distillation and underscore the importance of adapting reasoning complexity for effective reasoning capability transfer.
Related papers
- Self-Enhanced Reasoning Training: Activating Latent Reasoning in Small Models for Enhanced Reasoning Distillation [34.48068572413349]
We show that small models can generate high-quality reasoning paths during sampling, even without chain-of-thought prompting.
We propose Self-Enhanced Reasoning Training (SERT), which activates and leverages latent reasoning capabilities in small models.
Experiments using OpenAI's GPT-3.5 as the teacher model and GPT-2 models as the student models demonstrate that SERT enhances the reasoning abilities of small models.
arXiv Detail & Related papers (2025-02-18T11:02:47Z) - When More is Less: Understanding Chain-of-Thought Length in LLMs [53.77747102201451]
Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs)
However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy?
In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases.
arXiv Detail & Related papers (2025-02-11T05:28:59Z) - Training Language Models to Reason Efficiently [14.390800014819439]
We use reinforcement learning to train large reasoning models to reason efficiently.
Our method incentivizes models to minimize unnecessary computational overhead while maintaining accuracy.
Experiments on two open-weight large reasoning models demonstrate significant reductions in inference cost while preserving most of the accuracy.
arXiv Detail & Related papers (2025-02-06T19:18:16Z) - AS-ES Learning: Towards Efficient CoT Learning in Small Models [35.225382243612174]
Chain-of-Thought (CoT) serves as a critical emerging ability in Large Language Models (LLMs)
We propose a new training paradigm AS-ES (Abstractive Segments - Extractive Segments) learning, which exploits the inherent information in CoT for iterative generation.
Experiments show that our methods surpass the direct seq2seq training on CoT-extensive tasks like MWP and PET summarization, without data augmentation or altering the model itself.
arXiv Detail & Related papers (2024-03-04T12:13:59Z) - Large Language Models Are Reasoning Teachers [9.290757451344673]
Fine-tune-CoT is a method that generates reasoning samples from very large teacher models to fine-tune smaller models.
We find that Fine-tune-CoT enables substantial reasoning capability in small models, far outperforming prompt-based baselines and even the teacher model in many tasks.
arXiv Detail & Related papers (2022-12-20T08:24:45Z) - Distilling Reasoning Capabilities into Smaller Language Models [83.66051257039763]
Step-by-step reasoning approaches like chain of thought (CoT) have proved to be very effective in inducing reasoning capabilities in large language models.
However, the success of the CoT approach is fundamentally tied to the model size, and billion parameter-scale models are often needed to get CoT to work.
We propose a knowledge distillation approach that leverages the step-by-step CoT reasoning capabilities of larger models and distills these abilities into smaller models.
arXiv Detail & Related papers (2022-12-01T00:39:56Z) - Predicting on the Edge: Identifying Where a Larger Model Does Better [61.793778186198864]
We show that large models have the largest improvement on examples where the small model is most uncertain.
We show that a switcher model which defers examples to a larger model when a small model is uncertain can achieve striking improvements in performance and resource usage.
arXiv Detail & Related papers (2022-02-15T18:53:14Z) - When in Doubt, Summon the Titans: Efficient Inference with Large Models [80.2673230098021]
We propose a two-stage framework based on distillation that realizes the modelling benefits of large models.
We use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of "easy" examples.
Our proposed use of distillation to only handle easy instances allows for a more aggressive trade-off in the student size, thereby reducing the amortized cost of inference.
arXiv Detail & Related papers (2021-10-19T22:56:49Z) - On the Efficacy of Small Self-Supervised Contrastive Models without
Distillation Signals [44.209171209780365]
Small models perform quite poorly under the paradigm of self-supervised contrastive learning.
Existing methods usually adopt a large off-the-shelf model to transfer knowledge to the small one via knowledge distillation.
Despite their effectiveness, distillation-based methods may not be suitable for some resource-restricted scenarios.
arXiv Detail & Related papers (2021-07-30T17:10:05Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z) - When Ensembling Smaller Models is More Efficient than Single Large
Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute.
This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.