Specializing Smaller Language Models towards Multi-Step Reasoning
- URL: http://arxiv.org/abs/2301.12726v1
- Date: Mon, 30 Jan 2023 08:51:19 GMT
- Title: Specializing Smaller Language Models towards Multi-Step Reasoning
- Authors: Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal and Tushar Khot
- Abstract summary: We show that abilities can be distilled down from GPT-3.5 ($ge$ 175B) to T5 variants ($le$ 11B)
We propose model specialization, to specialize the model's ability towards a target task.
- Score: 56.78474185485288
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The surprising ability of Large Language Models (LLMs) to perform well on
complex reasoning with only few-shot chain-of-thought prompts is believed to
emerge only in very large-scale models (100+ billion parameters). We show that
such abilities can, in fact, be distilled down from GPT-3.5 ($\ge$ 175B) to T5
variants ($\le$ 11B). We propose model specialization, to specialize the
model's ability towards a target task. The hypothesis is that large models
(commonly viewed as larger than 100B) have strong modeling power, but are
spread on a large spectrum of tasks. Small models (commonly viewed as smaller
than 10B) have limited model capacity, but if we concentrate their capacity on
a specific target task, the model can achieve a decent improved performance. We
use multi-step math reasoning as our testbed because it is a very typical
emergent ability. We show two important aspects of model abilities: (1). there
exists a very complex balance/ tradeoff between language models'
multi-dimensional abilities; (2). by paying the price of decreased generic
ability, we can clearly lift up the scaling curve of models smaller than 10B
towards a specialized multi-step math reasoning ability. We further give
comprehensive discussions about important design choices for better
generalization, including the tuning data format, the start model checkpoint,
and a new model selection method. We hope our practice and discoveries can
serve as an important attempt towards specialized smaller models in the new
research paradigm set by LLMs.
Related papers
- What Matters for Model Merging at Scale? [94.26607564817786]
Model merging aims to combine multiple expert models into a more capable single model.
Previous studies have primarily focused on merging a few small models.
This study systematically evaluates the utility of model merging at scale.
arXiv Detail & Related papers (2024-10-04T17:17:19Z) - LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - Large Language Model Pruning [0.0]
We suggest a model pruning technique specifically focused on LLMs.
The proposed methodology emphasizes the explainability of deep learning models.
We also explore the difference between pruning on large-scale models vs. pruning on small-scale models.
arXiv Detail & Related papers (2024-05-24T18:22:15Z) - Emergent Abilities in Reduced-Scale Generative Language Models [10.51168925267033]
Large language models can solve new tasks without task-specific fine-tuning.
This ability is considered an emergent ability and is primarily seen in large language models with billions of parameters.
This study investigates if such emergent properties are strictly tied to model size or can be demonstrated by smaller models trained on reduced-scale data.
arXiv Detail & Related papers (2024-04-02T18:00:28Z) - When Do We Not Need Larger Vision Models? [55.957626371697785]
Scaling up the size of vision models has been the de facto standard to obtain more powerful visual representations.
We demonstrate the power of Scaling on Scales (S$2$), whereby a pre-trained and frozen smaller vision model can outperform larger models.
We release a Python package that can apply S$2$ on any vision model with one line of code.
arXiv Detail & Related papers (2024-03-19T17:58:39Z) - Large Language Models Are Reasoning Teachers [9.290757451344673]
Fine-tune-CoT is a method that generates reasoning samples from very large teacher models to fine-tune smaller models.
We find that Fine-tune-CoT enables substantial reasoning capability in small models, far outperforming prompt-based baselines and even the teacher model in many tasks.
arXiv Detail & Related papers (2022-12-20T08:24:45Z) - What Language Model to Train if You Have One Million GPU Hours? [54.32062236748831]
We study the impact of different modeling practices and their impact on zero-shot generalization.
We also study the performance of a multilingual model and how it compares to the English-only one.
All our models and code are open-sourced at https://huggingface.co/bigscience.
arXiv Detail & Related papers (2022-10-27T13:43:27Z) - Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost.
We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing.
This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.