Improving In-context Learning via Bidirectional Alignment
- URL: http://arxiv.org/abs/2312.17055v2
- Date: Mon, 24 Jun 2024 08:34:18 GMT
- Title: Improving In-context Learning via Bidirectional Alignment
- Authors: Chengwei Qin, Wenhan Xia, Fangkai Jiao, Chen Chen, Yuchen Hu, Bosheng Ding, Shafiq Joty,
- Abstract summary: Large language models (LLMs) have shown impressive few-shot generalization on many tasks via in-context learning (ICL)
We propose Bidirectional Alignment (BiAlign) to fully leverage the models' preferences for ICL examples to improve the ICL abilities of student models.
Specifically, we introduce the alignment of input preferences between student and teacher models by incorporating a novel ranking loss.
- Score: 41.214003703218914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have shown impressive few-shot generalization on many tasks via in-context learning (ICL). Despite their success in showing such emergent abilities, the scale and complexity of larger models also lead to unprecedentedly high computational demands and deployment challenges. In reaction, researchers explore transferring the powerful capabilities of larger models to more efficient and compact models by typically aligning the output of smaller (student) models with that of larger (teacher) models. Existing methods either train student models on the generated outputs of teacher models or imitate their token-level probability distributions. However, these distillation methods pay little to no attention to the input, which also plays a crucial role in ICL. Based on the finding that the performance of ICL is highly sensitive to the selection of demonstration examples, we propose Bidirectional Alignment (BiAlign) to fully leverage the models' preferences for ICL examples to improve the ICL abilities of student models. Specifically, we introduce the alignment of input preferences between student and teacher models by incorporating a novel ranking loss, in addition to aligning the token-level output distribution. With extensive experiments and analysis, we demonstrate that BiAlign can consistently outperform existing baselines on a variety of tasks involving language understanding, reasoning, and coding.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - CollectiveSFT: Scaling Large Language Models for Chinese Medical Benchmark with Collective Instructions in Healthcare [12.218718086529462]
This study focuses on the Comprehensive Medical Benchmark in Chinese (CMB)
We successfully trained a smaller base model to achieve scores comparable to larger models.
By integrating a wide range of instructional content, our approach addresses potential issues such as data quality inconsistencies.
arXiv Detail & Related papers (2024-07-29T05:00:48Z) - LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - Why Larger Language Models Do In-context Learning Differently? [12.554356517949785]
Large language models (LLM) have emerged as a powerful tool for AI, with the key ability of in-context learning (ICL)
One recent mysterious observation is that models of different scales may have different ICL behaviors.
arXiv Detail & Related papers (2024-05-30T01:11:35Z) - Small Models are Valuable Plug-ins for Large Language Models [65.29370906766997]
Large language models (LLMs) such as GPT-3 and GPT-4 are powerful but their weights are often publicly unavailable.
We propose Super In-Context Learning (SuperICL) which allows black-box LLMs to work with locally fine-tuned smaller models.
arXiv Detail & Related papers (2023-05-15T17:59:01Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z) - Prompt-Augmented Linear Probing: Scaling beyond the Limit of Few-shot
In-Context Learners [25.262774179224945]
This paper proposes prompt-augmented linear probing (PALP), a hybrid of linear probing and in-context learning (ICL)
PALP significantly enhances the input representations closing the gap between ICL in the data-hungry scenario and fine-tuning in the data-abundant scenario with little training overhead.
arXiv Detail & Related papers (2022-12-21T09:37:05Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.