Orca: Progressive Learning from Complex Explanation Traces of GPT-4
- URL: http://arxiv.org/abs/2306.02707v1
- Date: Mon, 5 Jun 2023 08:58:39 GMT
- Title: Orca: Progressive Learning from Complex Explanation Traces of GPT-4
- Authors: Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal,
Hamid Palangi, Ahmed Awadallah
- Abstract summary: We develop Orca, a 13-billion parameter model that learns to imitate the reasoning process of LFMs.
Orca learns from rich signals from GPT-4 including explanation traces; step-by-step thought processes; and other complex instructions.
Orca surpasses conventional state-of-the-art instruction-tuned models such as Vicuna-13B by more than 100% in complex zero-shot reasoning benchmarks.
- Score: 22.526048553548726
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent research has focused on enhancing the capability of smaller models
through imitation learning, drawing on the outputs generated by large
foundation models (LFMs). A number of issues impact the quality of these
models, ranging from limited imitation signals from shallow LFM outputs; small
scale homogeneous training data; and most notably a lack of rigorous evaluation
resulting in overestimating the small model's capability as they tend to learn
to imitate the style, but not the reasoning process of LFMs. To address these
challenges, we develop Orca (We are working with our legal team to publicly
release a diff of the model weights in accordance with LLaMA's release policy
to be published at https://aka.ms/orca-lm), a 13-billion parameter model that
learns to imitate the reasoning process of LFMs. Orca learns from rich signals
from GPT-4 including explanation traces; step-by-step thought processes; and
other complex instructions, guided by teacher assistance from ChatGPT. To
promote this progressive learning, we tap into large-scale and diverse
imitation data with judicious sampling and selection. Orca surpasses
conventional state-of-the-art instruction-tuned models such as Vicuna-13B by
more than 100% in complex zero-shot reasoning benchmarks like Big-Bench Hard
(BBH) and 42% on AGIEval. Moreover, Orca reaches parity with ChatGPT on the BBH
benchmark and shows competitive performance (4 pts gap with optimized system
message) in professional and academic examinations like the SAT, LSAT, GRE, and
GMAT, both in zero-shot settings without CoT; while trailing behind GPT-4. Our
research indicates that learning from step-by-step explanations, whether these
are generated by humans or more advanced AI models, is a promising direction to
improve model capabilities and skills.
Related papers
- GRIN: GRadient-INformed MoE [132.87651078514122]
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing.
We introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing.
Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data.
arXiv Detail & Related papers (2024-09-18T17:00:20Z) - Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch.
Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z) - How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship.
We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3.
While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z) - A Critical Evaluation of AI Feedback for Aligning Large Language Models [60.42291111149438]
We show that simple supervised fine-tuning with GPT-4 as the teacher outperforms existing RLAIF pipelines.
More generally, we find that the gains from RLAIF vary substantially across base model families, test-time evaluation protocols, and critic models.
arXiv Detail & Related papers (2024-02-19T18:53:54Z) - Evaluating and Enhancing Large Language Models for Conversational
Reasoning on Knowledge Graphs [15.480976967871632]
We evaluate the conversational reasoning capabilities of the current state-of-the-art large language model (GPT-4) on knowledge graphs (KGs)
We introduce LLM-ARK, a grounded KG reasoning agent designed to deliver precise and adaptable predictions on KG paths.
LLaMA-2-7B-ARK outperforms the current state-of-the-art model by 5.28 percentage points, with a performance rate of 36.39% on the target@1 evaluation metric.
arXiv Detail & Related papers (2023-12-18T15:23:06Z) - Orca 2: Teaching Small Language Models How to Reason [35.0285407867139]
Orca 1 learns from rich signals, such as explanation traces, allowing it to outperform conventional instruction-tuned models.
Orca 2 significantly surpasses models of similar size and attains performance levels similar or better to those of models 5-10x larger.
arXiv Detail & Related papers (2023-11-18T11:44:52Z) - Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias [57.42417061979399]
Recent studies show that instruction tuning (IT) and reinforcement learning from human feedback (RLHF) improve the abilities of large language models (LMs) dramatically.
In this work, we investigate the effect of IT and RLHF on decision making and reasoning in LMs.
Our findings highlight the presence of these biases in various models from the GPT-3, Mistral, and T5 families.
arXiv Detail & Related papers (2023-08-01T01:39:25Z) - The False Promise of Imitating Proprietary LLMs [158.65692029352584]
An emerging method to cheaply improve a weaker language model is to finetune it on outputs from a stronger model.
This approach looks to cheaply imitate the proprietary model's capabilities using a weaker open-source model.
We first finetune a series of LMs that imitate ChatGPT using varying base model sizes.
We then evaluate the models using crowd raters and canonical NLP benchmarks.
arXiv Detail & Related papers (2023-05-25T05:00:12Z) - Lion: Adversarial Distillation of Proprietary Large Language Models [16.245052771463044]
We propose a novel adversarial distillation framework for a more efficient knowledge transfer.
We successfully transfer knowledge from ChatGPT to a student model (named Lion) using a mere 70k training data.
arXiv Detail & Related papers (2023-05-22T09:49:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.