SeCoKD: Aligning Large Language Models for In-Context Learning with Fewer Shots
- URL: http://arxiv.org/abs/2406.14208v2
- Date: Thu, 26 Sep 2024 08:12:59 GMT
- Title: SeCoKD: Aligning Large Language Models for In-Context Learning with Fewer Shots
- Authors: Weixing Wang, Haojin Yang, Christoph Meinel,
- Abstract summary: We present SeCoKD, a self-Knowledge Distillation ( KD ) training framework that aligns the student model with a heavily prompted variation.
We experiment with the SeCoKD across three Large Language Models (LLMs) and six benchmarks focusing mainly on reasoning tasks.
Results show that our method outperforms the base model and Supervised Fine-tuning ( SFT )
SeCoKD brings little negative artifacts when evaluated on new tasks, which is more robust than Supervised Fine-tuning.
- Score: 9.048091324917515
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous studies have shown that demonstrations can significantly help Large Language Models (LLMs ) perform better on the given tasks. However, this so-called In-Context Learning ( ICL ) ability is very sensitive to the presenting context, and often dozens of demonstrations are needed. In this work, we investigate if we can reduce the shot number while still maintaining a competitive performance. We present SeCoKD, a self-Knowledge Distillation ( KD ) training framework that aligns the student model with a heavily prompted variation, thereby increasing the utilization of a single demonstration. We experiment with the SeCoKD across three LLMs and six benchmarks focusing mainly on reasoning tasks. Results show that our method outperforms the base model and Supervised Fine-tuning ( SFT ), especially in zero-shot and one-shot settings by 30% and 10%, respectively. Moreover, SeCoKD brings little negative artifacts when evaluated on new tasks, which is more robust than Supervised Fine-tuning.
Related papers
- Efficient Many-Shot In-Context Learning with Dynamic Block-Sparse Attention [45.20728476185864]
Many-shot in-context learning has recently shown promise as an alternative to finetuning.
This shifts the computational burden from training-time to inference-time.
We present Dynamic Block-Sparse Attention, a training-free framework for retrieval-based many-shot in-context learning.
arXiv Detail & Related papers (2025-03-11T17:30:58Z) - Active Data Curation Effectively Distills Large-Scale Multimodal Models [66.23057263509027]
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones.
In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining.
Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations.
arXiv Detail & Related papers (2024-11-27T18:50:15Z) - The Surprising Effectiveness of Test-Time Training for Abstract Reasoning [64.36534512742736]
We investigate the effectiveness of test-time training (TTT) as a mechanism for improving models' reasoning capabilities.
TTT significantly improves performance on ARC tasks, achieving up to 6x improvement in accuracy compared to base fine-tuned models.
Our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in neural language models.
arXiv Detail & Related papers (2024-11-11T18:59:45Z) - Focused Large Language Models are Stable Many-Shot Learners [18.783939647966776]
In-Context Learning (ICL) enables large language models (LLMs) to achieve rapid task adaptation by learning from demonstrations.
We propose a training-free method FocusICL, which conducts triviality filtering to avoid attention being diverted by unimportant contents.
We show that FocusICL achieves an average performance improvement of 5.2% over vanilla ICL and scales well with many-shot demonstrations.
arXiv Detail & Related papers (2024-08-26T02:53:24Z) - Many-Shot In-Context Learning [58.395589302800566]
Large language models (LLMs) excel at few-shot in-context learning (ICL)
We observe significant performance gains across a wide variety of generative and discriminative tasks.
Unlike few-shot learning, many-shot learning is effective at overriding pretraining biases.
arXiv Detail & Related papers (2024-04-17T02:49:26Z) - Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z) - Revisiting Demonstration Selection Strategies in In-Context Learning [66.11652803887284]
Large language models (LLMs) have shown an impressive ability to perform a wide range of tasks using in-context learning (ICL)
In this work, we first revisit the factors contributing to this variance from both data and model aspects, and find that the choice of demonstration is both data- and model-dependent.
We propose a data- and model-dependent demonstration selection method, textbfTopK + ConE, based on the assumption that textitthe performance of a demonstration positively correlates with its contribution to the model's understanding of the test samples.
arXiv Detail & Related papers (2024-01-22T16:25:27Z) - In-context Learning with Retrieved Demonstrations for Language Models: A Survey [23.24271704145876]
Few-shot in-context learners (ICL) are adept at adapting to new tasks with just a few demonstrations in the input context.
Instead of using a fixed set of demonstrations, one recent development is to retrieve demonstrations tailored to each input query.
We discuss and compare different design choices for retrieval models, retrieval training procedures, and inference algorithms.
arXiv Detail & Related papers (2024-01-21T23:34:42Z) - Dynamic Demonstrations Controller for In-Context Learning [51.3439660534631]
In-Context Learning (ICL) is a new paradigm for natural language processing (NLP), where a large language model observes a small number of demonstrations and a test instance as its input.
Previous studies have revealed that ICL is sensitive to the selection and the ordering of demonstrations.
We propose a Dynamic Demonstrations Controller (D$2$Controller), which can improve the ICL performance by adjusting the number of demonstrations.
arXiv Detail & Related papers (2023-09-30T14:04:22Z) - Scaling In-Context Demonstrations with Structured Attention [75.41845145597875]
We propose a better architectural design for in-context learning.
Structured Attention for In-Context Learning replaces the full-attention by a structured attention mechanism.
We show that SAICL achieves comparable or better performance than full attention while obtaining up to 3.4x inference speed-up.
arXiv Detail & Related papers (2023-07-05T23:26:01Z) - Dr.ICL: Demonstration-Retrieved In-context Learning [29.142262267850704]
In-context learning (ICL) teaching a large language model to perform a task with few-shot demonstrations has emerged as a strong paradigm for using LLMs.
Recent research suggests that retrieving semantically similar demonstrations to the input from a pool of available demonstrations results in better performance.
This work expands the applicability of retrieval-based ICL approaches by demonstrating that even simple word-overlap similarity measures such as BM25 outperform randomly selected demonstrations.
arXiv Detail & Related papers (2023-05-23T14:55:25Z) - DisCo: Effective Knowledge Distillation For Contrastive Learning of
Sentence Embeddings [36.37939188680754]
We propose an enhanced knowledge distillation framework termed Distill-Contrast (DisCo)
DisCo transfers the capability of a large sentence embedding model to a small student model on large unlabelled data.
We also propose Contrastive Knowledge Distillation (CKD) to enhance the consistencies among teacher model training, KD, and student model finetuning.
arXiv Detail & Related papers (2021-12-10T16:11:23Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.