Distilling Reasoning Capabilities into Smaller Language Models
- URL: http://arxiv.org/abs/2212.00193v2
- Date: Thu, 18 May 2023 04:44:51 GMT
- Title: Distilling Reasoning Capabilities into Smaller Language Models
- Authors: Kumar Shridhar, Alessandro Stolfo, Mrinmaya Sachan
- Abstract summary: Step-by-step reasoning approaches like chain of thought (CoT) have proved to be very effective in inducing reasoning capabilities in large language models.
However, the success of the CoT approach is fundamentally tied to the model size, and billion parameter-scale models are often needed to get CoT to work.
We propose a knowledge distillation approach that leverages the step-by-step CoT reasoning capabilities of larger models and distills these abilities into smaller models.
- Score: 83.66051257039763
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Step-by-step reasoning approaches like chain of thought (CoT) have proved to
be very effective in inducing reasoning capabilities in large language models.
However, the success of the CoT approach is fundamentally tied to the model
size, and billion parameter-scale models are often needed to get CoT to work.
In this paper, we propose a knowledge distillation approach that leverages the
step-by-step CoT reasoning capabilities of larger models and distills these
abilities into smaller models.
In this work, we propose an alternative reasoning scheme, Socratic CoT, that
learns a decomposition of the original problem into a sequence of subproblems
and uses it to guide the intermediate reasoning steps. We use Socratic CoT to
train a combination of two small distilled models: a problem decomposer and a
subproblem solver. In practice, given a new problem, the two distilled models
work in sync to decompose and solve complex problems. On multiple reasoning
datasets (GSM8K, StrategyQA, and SVAMP), our proposed distillation strategies
boosts the performance of smaller models over 70% compared to the baselines.
Finally, we investigate when Socratic CoT is an effective alternative to CoT,
demonstrating cases where a much smaller model (GPT-2 large) can outperform a
10X larger model (GPT-3 6B). Our code is available here:
https://github.com/kumar-shridhar/Distiiling-LM
Related papers
- SIKeD: Self-guided Iterative Knowledge Distillation for mathematical reasoning [49.29200323760457]
Large Language Models (LLMs) can transfer their reasoning skills to smaller models.
Smaller models are not expressive enough to fit the LLMs distribution on all strategies when distilled.
This reliance on one strategy poses a challenge for smaller models when attempting to solve reasoning tasks that may be difficult with their preferred strategy.
arXiv Detail & Related papers (2024-10-24T09:29:18Z) - Truncated Consistency Models [57.50243901368328]
Training consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints.
We empirically find that this training paradigm limits the one-step generation performance of consistency models.
We propose a new parameterization of the consistency function and a two-stage training procedure that prevents the truncated-time training from collapsing to a trivial solution.
arXiv Detail & Related papers (2024-10-18T22:38:08Z) - Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting [124.69672273754144]
Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of large language models (LLMs)
Existing CoT approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts.
We introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts.
arXiv Detail & Related papers (2024-03-21T11:34:26Z) - Divide-or-Conquer? Which Part Should You Distill Your LLM? [38.62667131299918]
We devise a similar strategy that breaks down reasoning tasks into a problem decomposition phase and a problem solving phase.
We show that the strategy is able to outperform a single stage solution.
arXiv Detail & Related papers (2024-02-22T22:28:46Z) - First-Step Advantage: Importance of Starting Right in Multi-Step Math Reasoning [11.75364271481855]
Language models can solve complex reasoning tasks better by learning to generate rationales for their predictions.
We observe that smaller models in particular when corrected, can solve a task that they would have otherwise struggled with.
We propose QuestCoT, where a smaller model first asks itself how to start, before proceeding with a chain of reasoning.
arXiv Detail & Related papers (2023-11-14T06:45:31Z) - Large Language Models Are Reasoning Teachers [9.290757451344673]
Fine-tune-CoT is a method that generates reasoning samples from very large teacher models to fine-tune smaller models.
We find that Fine-tune-CoT enables substantial reasoning capability in small models, far outperforming prompt-based baselines and even the teacher model in many tasks.
arXiv Detail & Related papers (2022-12-20T08:24:45Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.