LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms
- URL: http://arxiv.org/abs/2311.13133v1
- Date: Wed, 22 Nov 2023 03:37:01 GMT
- Title: LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms
- Authors: Aditi Jha, Sam Havens, Jeremey Dohmann, Alex Trott, Jacob Portes
- Abstract summary: We finetune open-source MPT-7B and MPT-30B models on instruction finetuning datasets of various sizes ranging from 1k to 60k samples.
We find that subsets of 1k-6k instruction finetuning samples are sufficient to achieve good performance on both (1) traditional NLP benchmarks and (2) model-based evaluation.
- Score: 2.249916681499244
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large Language Models are traditionally finetuned on large instruction
datasets. However recent studies suggest that small, high-quality datasets can
suffice for general purpose instruction following. This lack of consensus
surrounding finetuning best practices is in part due to rapidly diverging
approaches to LLM evaluation. In this study, we ask whether a small amount of
diverse finetuning samples can improve performance on both traditional
perplexity-based NLP benchmarks, and on open-ended, model-based evaluation. We
finetune open-source MPT-7B and MPT-30B models on instruction finetuning
datasets of various sizes ranging from 1k to 60k samples. We find that subsets
of 1k-6k instruction finetuning samples are sufficient to achieve good
performance on both (1) traditional NLP benchmarks and (2) model-based
evaluation. Finally, we show that mixing textbook-style and open-ended QA
finetuning datasets optimizes performance on both evaluation paradigms.
Related papers
- Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation [56.75665429851673]
This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment.
Experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%.
arXiv Detail & Related papers (2024-09-27T08:20:59Z) - Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling [3.7467864495337624]
SubLIME is a data-efficient evaluation framework for text-to-image models.
Our approach ensures statistically aligned model rankings compared to full datasets.
We leverage the HEIM leaderboard to cover 25 text-to-image models on 17 different benchmarks.
arXiv Detail & Related papers (2024-06-21T07:38:55Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Instruction Mining: Instruction Data Selection for Tuning Large Language Models [18.378654454336136]
InstructMining is designed for automatically selecting premium instruction-following data for finetuning large language models.
We show that InstructMining achieves state-of-the-art performance on two of the most popular benchmarks: LLM-as-a-judge and Huggingface OpenLLM leaderboard.
arXiv Detail & Related papers (2023-07-12T16:37:31Z) - How Far Can Camels Go? Exploring the State of Instruction Tuning on Open
Resources [117.6496550359768]
This work explores recent advances in instruction-tuning language models on a range of open instruction-following datasets.
We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets.
We evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities.
arXiv Detail & Related papers (2023-06-07T19:59:23Z) - Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models.
Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z) - Scaling Instruction-Finetuned Language Models [126.4789306516927]
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance.
We find that instruction finetuning dramatically improves performance on a variety of model classes.
arXiv Detail & Related papers (2022-10-20T16:58:32Z) - Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language
Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks.
Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients.
We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z) - SE3M: A Model for Software Effort Estimation Using Pre-trained Embedding
Models [0.8287206589886881]
This paper proposes to evaluate the effectiveness of pre-trained embeddings models.
Generic pre-trained models for both approaches went through a fine-tuning process.
Results were very promising, realizing that pre-trained models can be used to estimate software effort based only on requirements texts.
arXiv Detail & Related papers (2020-06-30T14:15:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.