Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step
- URL: http://arxiv.org/abs/2306.14050v2
- Date: Mon, 15 Apr 2024 21:58:27 GMT
- Title: Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step
- Authors: Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, Yejin Choi,
- Abstract summary: Chain-of-thought prompting primes large language models to verbalize rationalization for their predictions.
We show that orders-of-magnitude smaller models can still benefit from chain-of-thought prompting.
We introduce Chain-of-Thought Distillation (SCoTD), a method to train a smaller student model on rationalizations sampled from a significantly larger teacher model.
- Score: 133.60124577507727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chain-of-thought prompting (e.g., "Let's think step-by-step") primes large language models to verbalize rationalization for their predictions. While chain-of-thought can lead to dramatic performance gains, benefits appear to emerge only for sufficiently large models (beyond 50B parameters). We show that orders-of-magnitude smaller models (125M -- 1.3B parameters) can still benefit from chain-of-thought prompting. To achieve this, we introduce Symbolic Chain-of-Thought Distillation (SCoTD), a method to train a smaller student model on rationalizations sampled from a significantly larger teacher model. Experiments across several commonsense benchmarks show that: 1) SCoTD enhances the performance of the student model in both supervised and few-shot settings, and especially for challenge sets; 2) sampling many reasoning chains per instance from the teacher is paramount; and 3) after distillation, student chain-of-thoughts are judged by humans as comparable to the teacher, despite orders of magnitude fewer parameters. We test several hypotheses regarding what properties of chain-of-thought samples are important, e.g., diversity vs. teacher likelihood vs. open-endedness. We release our corpus of chain-of-thought samples and code.
Related papers
- Contrastive Chain-of-Thought Prompting [74.10511560147293]
We propose contrastive chain of thought to enhance language model reasoning.
Compared to the conventional chain of thought, our approach provides both valid and invalid reasoning demonstrations.
Our experiments on reasoning benchmarks demonstrate that contrastive chain of thought can serve as a general enhancement of chain-of-thought prompting.
arXiv Detail & Related papers (2023-11-15T18:54:01Z) - SCOTT: Self-Consistent Chain-of-Thought Distillation [68.40232422158569]
Large language models (LMs) generate free-text rationales for their predictions via chain-of-thought prompting.
We propose a faithful knowledge distillation method to learn a small, self-consistent CoT model from a teacher model that is orders of magnitude larger.
To ensure faithful distillation, we use the teacher-generated rationales to learn a student LM with a counterfactual reasoning objective.
arXiv Detail & Related papers (2023-05-03T03:47:00Z) - Synthetic Prompting: Generating Chain-of-Thought Demonstrations for
Large Language Models [121.54462976635743]
Large language models can perform various reasoning tasks by using chain-of-thought prompting, which guides them to find answers through step-by-step demonstrations.
We introduce Synthetic prompting, a method that leverages a few handcrafted examples to prompt the model to generate more examples by itself.
We evaluate our method on numerical, symbolic, and algorithmic reasoning tasks, and show that it outperforms existing prompting techniques.
arXiv Detail & Related papers (2023-02-01T17:33:12Z) - Large Language Models Are Reasoning Teachers [9.290757451344673]
Fine-tune-CoT is a method that generates reasoning samples from very large teacher models to fine-tune smaller models.
We find that Fine-tune-CoT enables substantial reasoning capability in small models, far outperforming prompt-based baselines and even the teacher model in many tasks.
arXiv Detail & Related papers (2022-12-20T08:24:45Z) - Reasoning Circuits: Few-shot Multihop Question Generation with
Structured Rationales [11.068901022944015]
Chain-of-thought rationale generation has been shown to improve performance on multi-step reasoning tasks.
We introduce a new framework for applying chain-of-thought inspired structured rationale generation to multi-hop question generation under a very low supervision regime.
arXiv Detail & Related papers (2022-11-15T19:36:06Z) - Automatic Chain of Thought Prompting in Large Language Models [20.54898481696753]
Large language models (LLMs) can perform complex reasoning by generating intermediate reasoning steps.
"Let's think step by step" prompt generates reasoning chains for demonstrations one by one.
We propose an automatic CoT prompting method: Auto-CoT.
arXiv Detail & Related papers (2022-10-07T12:28:21Z) - Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango [11.344587937052697]
This work initiates the preliminary steps towards a deeper understanding of reasoning mechanisms in large language models.
Our work centers around querying the model while controlling for all but one of the components in a prompt: symbols, patterns, and text.
We posit that text imbues patterns with commonsense knowledge and meaning.
arXiv Detail & Related papers (2022-09-16T02:54:00Z) - Chaos is a Ladder: A New Theoretical Understanding of Contrastive
Learning via Augmentation Overlap [64.60460828425502]
We propose a new guarantee on the downstream performance of contrastive learning.
Our new theory hinges on the insight that the support of different intra-class samples will become more overlapped under aggressive data augmentations.
We propose an unsupervised model selection metric ARC that aligns well with downstream accuracy.
arXiv Detail & Related papers (2022-03-25T05:36:26Z) - Mnemonics Training: Multi-Class Incremental Learning without Forgetting [131.1065577648532]
Multi-Class Incremental Learning (MCIL) aims to learn new concepts by incrementally updating a model trained on previous concepts.
This paper proposes a novel and automatic framework we call mnemonics, where we parameterize exemplars and make them optimizable in an end-to-end manner.
We conduct extensive experiments on three MCIL benchmarks, CIFAR-100, ImageNet-Subset and ImageNet, and show that using mnemonics exemplars can surpass the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-02-24T12:55:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.