Related papers: Stronger Models are NOT Stronger Teachers for Instruction Tuning

Stronger Models are NOT Stronger Teachers for Instruction Tuning

URL: http://arxiv.org/abs/2411.07133v3
Date: Wed, 26 Feb 2025 18:10:05 GMT
Title: Stronger Models are NOT Stronger Teachers for Instruction Tuning
Authors: Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Radha Poovendran,
Abstract summary: We show that larger and stronger models are not necessarily stronger teachers of smaller models.<n>We thus develop a novel metric, named as Compatibility-Adjusted Reward (CAR) to measure the effectiveness of response generators.
Score: 12.87887398974395
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Instruction tuning has been widely adopted to ensure large language models (LLMs) follow user instructions effectively. The resulting instruction-following capabilities of LLMs heavily rely on the instruction datasets used for tuning. Recently, synthetic instruction datasets have emerged as an economically viable solution to provide LLMs diverse and high-quality instructions. However, existing approaches typically assume that larger or stronger models are stronger teachers for instruction tuning, and hence simply adopt these models as response generators to the synthetic instructions. In this paper, we challenge this commonly-adopted assumption. Our extensive experiments across five base models and twenty response generators reveal that larger and stronger models are not necessarily stronger teachers of smaller models. We refer to this phenomenon as the Larger Models' Paradox. We observe that existing metrics cannot precisely predict the effectiveness of response generators since they ignore the compatibility between teachers and base models being fine-tuned. We thus develop a novel metric, named as Compatibility-Adjusted Reward (CAR) to measure the effectiveness of response generators. Our experiments across five base models demonstrate that CAR outperforms almost all baselines.

Related papers

High-Rank Structured Modulation for Parameter-Efficient Fine-Tuning [57.85676271833619]
Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning.<n>We present textbfSMoA, a high-rank textbfStructured textbfMOdulation textbfAdapter that uses fewer trainable parameters while maintaining a higher rank.
arXiv Detail & Related papers (2026-01-12T13:06:17Z)
WST: Weak-to-Strong Knowledge Transfer via Reinforcement Learning [4.795603034052733]
We introduce Weak-to-Strong Transfer (WST), an automatic prompt engineering framework where a small "Teacher" model generates instructions that enhance the performance of a much larger "Student" model.<n>Using reinforcement learning, the Teacher Model's instructions are iteratively improved based on the Student Model's outcomes.<n>These results demonstrate that small models can reliably scaffold larger ones, unlocking latent capabilities while avoiding misleading prompts that stronger teachers may introduce.
arXiv Detail & Related papers (2025-08-22T18:33:06Z)
GRAM: A Generative Foundation Reward Model for Reward Generalization [48.63394690265176]
We develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning.<n>This model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning.
arXiv Detail & Related papers (2025-06-17T04:34:27Z)
HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling [52.58723853697152]
We propose a Hybrid Architecture Distillation (HAD) approach for DNA sequence modeling.<n>We employ the NTv2-500M as the teacher model and devise a grouping masking strategy.<n>Compared to models with similar parameters, our model achieved excellent performance.
arXiv Detail & Related papers (2025-05-27T07:57:35Z)
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models [27.142703756752997]
We introduce MathIF, a benchmark for evaluating instruction-following in mathematical reasoning tasks.<n>Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability.<n>We show that even simple interventions can partially recover obedience, though at the cost of reasoning performance.
arXiv Detail & Related papers (2025-05-20T18:18:01Z)
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [88.29990536278167]
We introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs. Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
arXiv Detail & Related papers (2024-12-16T09:47:43Z)
Smaller Language Models Are Better Instruction Evolvers [10.587052565101844]
Small language models (SLMs) can synthesize more effective instructions than large language models (LLMs) We propose Instruction Complex-Aware IFD (IC-IFD) to evaluate the effectiveness of instruction data more accurately.
arXiv Detail & Related papers (2024-12-15T16:07:48Z)
Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z)
Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains. We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z)
The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction [22.659005954676598]
We show that it is possible to significantly improve the performance of Large Language Models by selectively removing higher-order components of their weight matrices. This simple intervention, which we call LAyer-SElective Rank reduction (LASER), can be done on a model after training has completed. We show extensive experiments demonstrating the generality of this finding across language models and datasets.
arXiv Detail & Related papers (2023-12-21T03:51:08Z)
Tuna: Instruction Tuning using Feedback from Large Language Models [74.04950416204551]
We propose finetuning an instruction-tuned large language model using our novel textitprobabilistic ranking and textitcontextual ranking approaches. Probabilistic ranking enables the instruction-tuned model to inherit the relative rankings of high-quality and low-quality responses from the teacher LLM. On the other hand, learning with contextual ranking allows the model to refine its own response distribution using the contextual understanding ability of stronger LLMs.
arXiv Detail & Related papers (2023-10-20T09:55:06Z)
Evaluating the Robustness to Instructions of Large Language Models [6.947956990248856]
Fine-tuning Large Language Models (LLMs) can boost their zero-shot capabilities on novel tasks. We evaluate six models including Alpaca, Vicuna, WizardLM, and Traditional Task-oriented Models(Flan-T5-XL/XXL, T0++) We find that the robustness of different scales of FLAN-T5 models to RE instruction is worse than the robustness to QA instruction.
arXiv Detail & Related papers (2023-08-28T04:57:07Z)
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models. Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z)
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions [28.937552799649808]
Large language models (LLMs) with instruction fine-tuning demonstrate superior generative capabilities. We develop a large set of 2.58M instructions based on both existing and newly-generated instructions. We fine-tune a diverse herd of models, collectively referred to as LaMini-LM, which includes models from both the encoder-decoder and decoder-only families.
arXiv Detail & Related papers (2023-04-27T17:58:49Z)
Large Language Models Are Reasoning Teachers [9.290757451344673]
Fine-tune-CoT is a method that generates reasoning samples from very large teacher models to fine-tune smaller models. We find that Fine-tune-CoT enables substantial reasoning capability in small models, far outperforming prompt-based baselines and even the teacher model in many tasks.
arXiv Detail & Related papers (2022-12-20T08:24:45Z)
METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model. We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO) The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z)
When Ensembling Smaller Models is More Efficient than Single Large Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute. This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.