Distilling Mathematical Reasoning Capabilities into Small Language
Models
- URL: http://arxiv.org/abs/2401.11864v4
- Date: Thu, 1 Feb 2024 18:16:04 GMT
- Title: Distilling Mathematical Reasoning Capabilities into Small Language
Models
- Authors: Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang
- Abstract summary: This work addresses the challenge of democratizing advanced Large Language Models (LLMs) by compressing their mathematical reasoning capabilities into sub-billion parameter Small Language Models (SLMs)
We introduce Equation-of-Thought Distillation (EoTD), a novel technique that encapsulates the reasoning process into equation-based representations to construct an EoTD dataset for fine-tuning SLMs.
- Score: 23.354025348567077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work addresses the challenge of democratizing advanced Large Language
Models (LLMs) by compressing their mathematical reasoning capabilities into
sub-billion parameter Small Language Models (SLMs) without compromising
performance. We introduce Equation-of-Thought Distillation (EoTD), a novel
technique that encapsulates the reasoning process into equation-based
representations to construct an EoTD dataset for fine-tuning SLMs.
Additionally, we propose the Ensemble Thoughts Distillation (ETD) framework to
enhance the reasoning performance of SLMs. This involves creating a reasoning
dataset with multiple thought processes, including Chain-of-Thought (CoT),
Program-of-Thought (PoT), and Equation-of-Thought (EoT), and using it for
fine-tuning. Our experimental findings demonstrate that EoTD significantly
boosts the reasoning abilities of SLMs, while ETD enables these models to
achieve state-of-the-art reasoning performance.
Related papers
- Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model [15.542737858152053]
We propose Key-Point-Driven Mathematical Reasoning Distillation (KPDD) to mitigate misunderstanding errors.
KPDD enhances the reasoning performance of SLMs by breaking down the problem-solving process into three stages.
Experiments show KPDD-CoT significantly improves reasoning abilities, while KPDD-PoT achieves state-of-the-art performance in mathematical reasoning tasks.
arXiv Detail & Related papers (2024-07-14T11:41:03Z) - Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - Enabling Natural Zero-Shot Prompting on Encoder Models via Statement-Tuning [55.265138447400744]
Statement-Tuning is a technique that models discriminative tasks as a set of finite statements and trains an model to discriminate between the potential statements to determine the label.
Experimental results demonstrate that Statement Tuning achieves competitive performance compared to state-of-the-art LLMs with significantly fewer parameters.
arXiv Detail & Related papers (2024-04-19T14:05:03Z) - Dual Instruction Tuning with Large Language Models for Mathematical Reasoning [26.00472810721806]
We propose a dual instruction tuning strategy to model mathematical reasoning from both forward and reverse directions.
This involves introducing the Intermediate Reasoning State Prediction task (forward reasoning) and the Instruction Reconstruction task (reverse reasoning) to enhance the LLMs' understanding and execution of instructions.
Comprehensive experiments validate the effectiveness and domain generalization of the dual instruction tuning strategy across various mathematical reasoning tasks.
arXiv Detail & Related papers (2024-03-27T06:43:58Z) - Enhancing Large Language Model with Decomposed Reasoning for Emotion
Cause Pair Extraction [13.245873138716044]
Emotion-Cause Pair Extraction (ECPE) involves extracting clause pairs representing emotions and their causes in a document.
Inspired by recent work, we explore leveraging large language model (LLM) to address ECPE task without additional training.
We introduce chain-of-thought to mimic human cognitive process and propose the Decomposed Emotion-Cause Chain (DECC) framework.
arXiv Detail & Related papers (2024-01-31T10:20:01Z) - Guiding Language Model Math Reasoning with Planning Tokens [128.57605860640948]
We introduce planning tokens at the start of each reasoning step, serving as a guide for the model, and add their embeddings to the model parameters.
Our approach requires a negligible increase in trainable parameters (just 0.001%) and can be applied through either full fine-tuning or a more parameter-efficient scheme.
arXiv Detail & Related papers (2023-10-09T13:29:37Z) - No Train Still Gain. Unleash Mathematical Reasoning of Large Language
Models with Monte Carlo Tree Search Guided by Energy Function [3.0299876288833345]
Large language models (LLMs) demonstrate impressive language understanding and contextual learning abilities.
LLMs often struggle to generate correct reasoning steps and answers despite having high probabilities for the solutions.
We propose a method that incorporates Monte Carlo Tree Search (MCTS) and a lightweight energy function to rank decision steps.
arXiv Detail & Related papers (2023-09-01T13:10:54Z) - Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge
Distillation in Small Models for Scientific QA [5.117094291273979]
Large Language Models (LLMs) have shown outstanding performance across wide range of downstream tasks.
We propose Sci-CoT, a two-stage framework that separates the processes of generating rationales and inferring answers.
Our 80-million parameter model is able to exceed the performance of BLOOM-176B in the ARC-Easy dataset under the few shot setting.
arXiv Detail & Related papers (2023-08-09T03:18:07Z) - MinT: Boosting Generalization in Mathematical Reasoning via Multi-View
Fine-Tuning [53.90744622542961]
Reasoning in mathematical domains remains a significant challenge for small language models (LMs)
We introduce a new method that exploits existing mathematical problem datasets with diverse annotation styles.
Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches.
arXiv Detail & Related papers (2023-07-16T05:41:53Z) - Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models [74.40196814292426]
We propose Graph-of-Thought (GoT) reasoning, which models human thought processes not only as a chain but also as a graph.
GoT captures the non-sequential nature of human thinking and allows for a more realistic modeling of thought processes.
We evaluate GoT's performance on a text-only reasoning task and a multimodal reasoning task.
arXiv Detail & Related papers (2023-05-26T02:15:09Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.