Distilling Reasoning Capabilities into Smaller Language Models
- URL: http://arxiv.org/abs/2212.00193v2
- Date: Thu, 18 May 2023 04:44:51 GMT
- Title: Distilling Reasoning Capabilities into Smaller Language Models
- Authors: Kumar Shridhar, Alessandro Stolfo, Mrinmaya Sachan
- Abstract summary: Step-by-step reasoning approaches like chain of thought (CoT) have proved to be very effective in inducing reasoning capabilities in large language models.
However, the success of the CoT approach is fundamentally tied to the model size, and billion parameter-scale models are often needed to get CoT to work.
We propose a knowledge distillation approach that leverages the step-by-step CoT reasoning capabilities of larger models and distills these abilities into smaller models.
- Score: 83.66051257039763
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Step-by-step reasoning approaches like chain of thought (CoT) have proved to
be very effective in inducing reasoning capabilities in large language models.
However, the success of the CoT approach is fundamentally tied to the model
size, and billion parameter-scale models are often needed to get CoT to work.
In this paper, we propose a knowledge distillation approach that leverages the
step-by-step CoT reasoning capabilities of larger models and distills these
abilities into smaller models.
In this work, we propose an alternative reasoning scheme, Socratic CoT, that
learns a decomposition of the original problem into a sequence of subproblems
and uses it to guide the intermediate reasoning steps. We use Socratic CoT to
train a combination of two small distilled models: a problem decomposer and a
subproblem solver. In practice, given a new problem, the two distilled models
work in sync to decompose and solve complex problems. On multiple reasoning
datasets (GSM8K, StrategyQA, and SVAMP), our proposed distillation strategies
boosts the performance of smaller models over 70% compared to the baselines.
Finally, we investigate when Socratic CoT is an effective alternative to CoT,
demonstrating cases where a much smaller model (GPT-2 large) can outperform a
10X larger model (GPT-3 6B). Our code is available here:
https://github.com/kumar-shridhar/Distiiling-LM
Related papers
- Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step [47.608403357284026]
In this paper, we investigate if models can be taught to internalize explicit chain-of-thought (CoT) steps.
We propose a simple yet effective method for internalizing CoT steps, starting with a model trained for explicit CoT reasoning.
Our method proves effective on larger language models, such as Mistral 7B, achieving over 50% accuracy on GSM8K without producing any intermediate steps.
arXiv Detail & Related papers (2024-05-23T17:54:14Z) - ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting [124.69672273754144]
Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of large language models (LLMs)
Existing CoT approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts.
We introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts.
arXiv Detail & Related papers (2024-03-21T11:34:26Z) - Divide-or-Conquer? Which Part Should You Distill Your LLM? [40.563633582127316]
We devise a similar strategy that breaks down reasoning tasks into a problem decomposition phase and a problem solving phase.
We show that the strategy is able to outperform a single stage solution.
arXiv Detail & Related papers (2024-02-22T22:28:46Z) - First-Step Advantage: Importance of Starting Right in Multi-Step Math Reasoning [11.75364271481855]
Language models can solve complex reasoning tasks better by learning to generate rationales for their predictions.
We observe that smaller models in particular when corrected, can solve a task that they would have otherwise struggled with.
We propose QuestCoT, where a smaller model first asks itself how to start, before proceeding with a chain of reasoning.
arXiv Detail & Related papers (2023-11-14T06:45:31Z) - Guiding Language Model Math Reasoning with Planning Tokens [128.57605860640948]
We introduce planning tokens at the start of each reasoning step, serving as a guide for the model, and add their embeddings to the model parameters.
Our approach requires a negligible increase in trainable parameters (just 0.001%) and can be applied through either full fine-tuning or a more parameter-efficient scheme.
arXiv Detail & Related papers (2023-10-09T13:29:37Z) - Large Language Models Are Reasoning Teachers [9.290757451344673]
Fine-tune-CoT is a method that generates reasoning samples from very large teacher models to fine-tune smaller models.
We find that Fine-tune-CoT enables substantial reasoning capability in small models, far outperforming prompt-based baselines and even the teacher model in many tasks.
arXiv Detail & Related papers (2022-12-20T08:24:45Z) - MiniALBERT: Model Distillation via Parameter-Efficient Recursive
Transformers [12.432191400869002]
MiniALBERT is a technique for converting the knowledge of fully parameterised LMs (such as BERT) into a compact recursive student.
We test our proposed models on a number of general and biomedical NLP tasks to demonstrate their viability and compare them with the state-of-the-art and other existing compact models.
arXiv Detail & Related papers (2022-10-12T17:23:21Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.