How Far Can Camels Go? Exploring the State of Instruction Tuning on Open
Resources
- URL: http://arxiv.org/abs/2306.04751v2
- Date: Mon, 30 Oct 2023 20:36:20 GMT
- Title: How Far Can Camels Go? Exploring the State of Instruction Tuning on Open
Resources
- Authors: Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot,
Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz
Beltagy, Hannaneh Hajishirzi
- Abstract summary: This work explores recent advances in instruction-tuning language models on a range of open instruction-following datasets.
We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets.
We evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities.
- Score: 117.6496550359768
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work we explore recent advances in instruction-tuning language models
on a range of open instruction-following datasets. Despite recent claims that
open models can be on par with state-of-the-art proprietary models, these
claims are often accompanied by limited evaluation, making it difficult to
compare models across the board and determine the utility of various resources.
We provide a large set of instruction-tuned models from 6.7B to 65B parameters
in size, trained on 12 instruction datasets ranging from manually curated
(e.g., OpenAssistant) to synthetic and distilled (e.g., Alpaca) and
systematically evaluate them on their factual knowledge, reasoning,
multilinguality, coding, and open-ended instruction following abilities through
a collection of automatic, model-based, and human-based metrics. We further
introduce T\"ulu, our best performing instruction-tuned model suite finetuned
on a combination of high-quality open resources. Our experiments show that
different instruction-tuning datasets can uncover or enhance specific skills,
while no single dataset (or combination) provides the best performance across
all evaluations. Interestingly, we find that model and human preference-based
evaluations fail to reflect differences in model capabilities exposed by
benchmark-based evaluations, suggesting the need for the type of systemic
evaluation performed in this work. Our evaluations show that the best model in
any given evaluation reaches on average 87% of ChatGPT performance, and 73% of
GPT-4 performance, suggesting that further investment in building better base
models and instruction-tuning data is required to close the gap. We release our
instruction-tuned models, including a fully finetuned 65B T\"ulu, along with
our code, data, and evaluation framework at
https://github.com/allenai/open-instruct to facilitate future research.
Related papers
- Self-Judge: Selective Instruction Following with Alignment Self-Evaluation [27.69410513313001]
We study the study of selective instruction following, whereby the system declines to execute instructions if the anticipated response quality is low.
We introduce Self-J, a novel self-training framework for developing judge models without needing human-annotated quality scores.
arXiv Detail & Related papers (2024-09-02T04:14:13Z) - Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only.
Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z) - LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms [2.249916681499244]
We finetune open-source MPT-7B and MPT-30B models on instruction finetuning datasets of various sizes ranging from 1k to 60k samples.
We find that subsets of 1k-6k instruction finetuning samples are sufficient to achieve good performance on both (1) traditional NLP benchmarks and (2) model-based evaluation.
arXiv Detail & Related papers (2023-11-22T03:37:01Z) - Fantastic Gains and Where to Find Them: On the Existence and Prospect of
General Knowledge Transfer between Any Pretrained Model [74.62272538148245]
We show that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other.
We investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation.
arXiv Detail & Related papers (2023-10-26T17:59:46Z) - Harnessing the Power of David against Goliath: Exploring Instruction
Data Generation without Using Closed-Source Models [32.41573520305861]
We explore alternative approaches to generate high-quality instruction data that do not rely on closed-source models.
Evaluation results from two benchmarks and the GPT-4 model demonstrate the effectiveness of our generated instruction data.
arXiv Detail & Related papers (2023-08-24T11:07:47Z) - INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large
Language Models [39.46610170563634]
INSTRUCTEVAL is a more comprehensive evaluation suite designed specifically for instruction-tuned large language models.
We take a holistic approach to analyze various factors affecting model performance, including the pretraining foundation, instruction-tuning data, and training methods.
Our findings reveal that the quality of instruction data is the most crucial factor in scaling model performance.
arXiv Detail & Related papers (2023-06-07T20:12:29Z) - Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models.
We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models.
Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Scaling Instruction-Finetuned Language Models [126.4789306516927]
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance.
We find that instruction finetuning dramatically improves performance on a variety of model classes.
arXiv Detail & Related papers (2022-10-20T16:58:32Z) - Learning to Compare for Better Training and Evaluation of Open Domain
Natural Language Generation Models [23.62054164511058]
We propose to evaluate natural language generation models by learning to compare a pair of generated sentences by fine-tuning BERT.
While able to be trained in a fully self-supervised fashion, our model can be further fine-tuned with a little amount of human preference annotation.
arXiv Detail & Related papers (2020-02-12T15:52:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.