Lion: Adversarial Distillation of Proprietary Large Language Models
- URL: http://arxiv.org/abs/2305.12870v2
- Date: Sat, 14 Oct 2023 02:21:24 GMT
- Title: Lion: Adversarial Distillation of Proprietary Large Language Models
- Authors: Yuxin Jiang, Chunkit Chan, Mingyang Chen, Wei Wang
- Abstract summary: We propose a novel adversarial distillation framework for a more efficient knowledge transfer.
We successfully transfer knowledge from ChatGPT to a student model (named Lion) using a mere 70k training data.
- Score: 16.245052771463044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The practice of transferring knowledge from a sophisticated, proprietary
large language model (LLM) to a compact, open-source LLM has garnered
considerable attention. Previous works have focused on a unidirectional
knowledge distillation way by aligning the responses of the student model with
those of the teacher model to a set of instructions. Nevertheless, they
overlooked the possibility of incorporating any reciprocal
"feedback"--identifying challenging instructions where the student model's
performance falls short--to boost the student model's proficiency iteratively.
To this end, we propose a novel adversarial distillation framework for a more
efficient knowledge transfer. Leveraging the versatile role adaptability of
LLMs, we prompt the teacher model to identify "hard" instructions and generate
new "hard" instructions for the student model, creating a three-stage
adversarial loop of imitation, discrimination, and generation. By applying this
adversarial framework, we successfully transfer knowledge from ChatGPT to a
student model (named Lion), using a mere 70k training data. Our results show
that Lion-13B not only achieves comparable open-ended generation capabilities
to ChatGPT but surpasses conventional state-of-the-art (SOTA) instruction-tuned
models like Vicuna-13B by 55.4% in challenging zero-shot reasoning benchmarks
such as BIG-Bench Hard (BBH) and 16.7% on AGIEval. Code and model can be found
at https://github.com/YJiangcm/Lion.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - Distillation Matters: Empowering Sequential Recommenders to Match the Performance of Large Language Model [12.6937643116018]
Large Language Models (LLMs) have been effectively utilized as recommenders, achieving impressive performance.
However, the high inference latency of LLMs significantly restricts their practical deployment.
This work investigates knowledge distillation from cumbersome LLM-based recommendation models to lightweight sequential models.
arXiv Detail & Related papers (2024-05-01T06:23:54Z) - Information-Theoretic Distillation for Reference-less Summarization [67.51150817011617]
We present a novel framework to distill a powerful summarizer based on the information-theoretic objective for summarization.
We start off from Pythia-2.8B as the teacher model, which is not yet capable of summarization.
We arrive at a compact but powerful summarizer with only 568M parameters that performs competitively against ChatGPT.
arXiv Detail & Related papers (2024-03-20T17:42:08Z) - Improving In-context Learning via Bidirectional Alignment [41.214003703218914]
Large language models (LLMs) have shown impressive few-shot generalization on many tasks via in-context learning (ICL)
We propose Bidirectional Alignment (BiAlign) to fully leverage the models' preferences for ICL examples to improve the ICL abilities of student models.
Specifically, we introduce the alignment of input preferences between student and teacher models by incorporating a novel ranking loss.
arXiv Detail & Related papers (2023-12-28T15:02:03Z) - Teaching Language Models to Self-Improve through Interactive Demonstrations [83.9421355808174]
Self-improving ability of large language models has been shown to be absent and difficult to learn for smaller models.
We introduce TriPosT, a training algorithm that endows smaller models with such self-improvement ability.
We show that our approach can improve a LLaMA-7b's performance on math and reasoning tasks by up to 7.13%.
arXiv Detail & Related papers (2023-10-20T14:11:04Z) - MiniLLM: Knowledge Distillation of Large Language Models [112.93051247165089]
Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs)
We propose a KD approach that distills LLMs into smaller language models.
Our method is scalable for different model families with 120M to 13B parameters.
arXiv Detail & Related papers (2023-06-14T14:44:03Z) - Orca: Progressive Learning from Complex Explanation Traces of GPT-4 [22.526048553548726]
We develop Orca, a 13-billion parameter model that learns to imitate the reasoning process of LFMs.
Orca learns from rich signals from GPT-4 including explanation traces; step-by-step thought processes; and other complex instructions.
Orca surpasses conventional state-of-the-art instruction-tuned models such as Vicuna-13B by more than 100% in complex zero-shot reasoning benchmarks.
arXiv Detail & Related papers (2023-06-05T08:58:39Z) - Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models.
Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z) - Explicit Knowledge Transfer for Weakly-Supervised Code Generation [14.758396460685017]
We propose explicit knowledge transfer (EKT) to transfer the code generation ability of an LLM to a smaller model.
EKT uses the few-shot capabilities of a teacher LLM to create NL-code pairs that we then filter for correctness and fine-tune the student on.
We find that EKT not only yields better performance than training with expert iteration, but also outperforms knowledge distillation.
arXiv Detail & Related papers (2022-11-30T04:51:26Z) - Boosting Contrastive Learning with Relation Knowledge Distillation [12.14219750487548]
We propose a relation-wise contrastive paradigm with Relation Knowledge Distillation (ReKD)
We show that our method achieves significant improvements on multiple lightweight models.
arXiv Detail & Related papers (2021-12-08T08:49:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.