MEND: Meta dEmonstratioN Distillation for Efficient and Effective
In-Context Learning
- URL: http://arxiv.org/abs/2403.06914v2
- Date: Tue, 12 Mar 2024 15:52:14 GMT
- Title: MEND: Meta dEmonstratioN Distillation for Efficient and Effective
In-Context Learning
- Authors: Yichuan Li, Xiyao Ma, Sixing Lu, Kyumin Lee, Xiaohu Liu, Chenlei Guo
- Abstract summary: Large Language models (LLMs) make predictions for a given test input together with a few input-output pairs (demonstrations)
Existing solutions attempt to distill lengthy demonstrations into compact vectors.
We present Meta dEmonstratioN Distillation (MEND), where a language model learns to distill any lengthy demonstrations into vectors without retraining for a new downstream task.
- Score: 9.271196993624944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language models (LLMs) have demonstrated impressive in-context learning
(ICL) capabilities, where a LLM makes predictions for a given test input
together with a few input-output pairs (demonstrations). Nevertheless, the
inclusion of demonstrations leads to a quadratic increase in the computational
overhead of the self-attention mechanism. Existing solutions attempt to distill
lengthy demonstrations into compact vectors. However, they often require
task-specific retraining or compromise LLM's in-context learning performance.
To mitigate these challenges, we present Meta dEmonstratioN Distillation
(MEND), where a language model learns to distill any lengthy demonstrations
into vectors without retraining for a new downstream task. We exploit the
knowledge distillation to enhance alignment between MEND and LLM, achieving
both efficiency and effectiveness simultaneously. MEND is endowed with the
meta-knowledge of distilling demonstrations through a two-stage training
process, which includes meta-distillation pretraining and fine-tuning.
Comprehensive evaluations across seven diverse ICL task partitions using
decoder-only (GPT-2) and encoder-decoder (T5) attest to MEND's prowess. It not
only matches but often outperforms the Vanilla ICL as well as other
state-of-the-art distillation models, while significantly reducing the
computational demands. This innovation promises enhanced scalability and
efficiency for the practical deployment of large language models
Related papers
- CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models [68.64605538559312]
In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives.
Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance.
In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs.
arXiv Detail & Related papers (2024-07-29T23:18:55Z) - Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization [0.6445087473595953]
Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning.
deploying LLM inference poses challenges due to the high compute and memory requirements.
We present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision.
arXiv Detail & Related papers (2024-06-16T09:51:55Z) - Rectifying Demonstration Shortcut in In-Context Learning [15.08431909212102]
Large language models (LLMs) are able to solve various tasks with only a few demonstrations utilizing their in-context learning (ICL) abilities.
LLMs often rely on their pre-trained semantic priors of demonstrations rather than on the input-label relationships to proceed with ICL prediction.
arXiv Detail & Related papers (2024-03-14T15:30:14Z) - Take One Step at a Time to Know Incremental Utility of Demonstration: An Analysis on Reranking for Few-Shot In-Context Learning [23.932500424117244]
In-Context Learning (ICL) is an emergent capability of Large Language Models (LLMs)
Previous studies have shown that using LLMs' outputs as labels is effective in training models to select demonstrations.
This paper presents an analysis on different utility functions by focusing on LLMs' output probability given ground-truth output.
arXiv Detail & Related papers (2023-11-16T07:03:54Z) - Dynamic Demonstrations Controller for In-Context Learning [51.3439660534631]
In-Context Learning (ICL) is a new paradigm for natural language processing (NLP), where a large language model observes a small number of demonstrations and a test instance as its input.
Previous studies have revealed that ICL is sensitive to the selection and the ordering of demonstrations.
We propose a Dynamic Demonstrations Controller (D$2$Controller), which can improve the ICL performance by adjusting the number of demonstrations.
arXiv Detail & Related papers (2023-09-30T14:04:22Z) - Scaling In-Context Demonstrations with Structured Attention [75.41845145597875]
We propose a better architectural design for in-context learning.
Structured Attention for In-Context Learning replaces the full-attention by a structured attention mechanism.
We show that SAICL achieves comparable or better performance than full attention while obtaining up to 3.4x inference speed-up.
arXiv Detail & Related papers (2023-07-05T23:26:01Z) - Dr.ICL: Demonstration-Retrieved In-context Learning [29.142262267850704]
In-context learning (ICL) teaching a large language model to perform a task with few-shot demonstrations has emerged as a strong paradigm for using LLMs.
Recent research suggests that retrieving semantically similar demonstrations to the input from a pool of available demonstrations results in better performance.
This work expands the applicability of retrieval-based ICL approaches by demonstrating that even simple word-overlap similarity measures such as BM25 outperform randomly selected demonstrations.
arXiv Detail & Related papers (2023-05-23T14:55:25Z) - Iterative Forward Tuning Boosts In-Context Learning in Language Models [88.25013390669845]
In this study, we introduce a novel two-stage framework to boost in-context learning in large language models (LLMs)
Specifically, our framework delineates the ICL process into two distinct stages: Deep-Thinking and test stages.
The Deep-Thinking stage incorporates a unique attention mechanism, i.e., iterative enhanced attention, which enables multiple rounds of information accumulation.
arXiv Detail & Related papers (2023-05-22T13:18:17Z) - Pre-training Language Model as a Multi-perspective Course Learner [103.17674402415582]
This study proposes a multi-perspective course learning (MCL) method for sample-efficient pre-training.
In this study, three self-supervision courses are designed to alleviate inherent flaws of "tug-of-war" dynamics.
Our method significantly improves ELECTRA's average performance by 2.8% and 3.2% absolute points respectively on GLUE and SQuAD 2.0 benchmarks.
arXiv Detail & Related papers (2023-05-06T09:02:10Z) - MC-BERT: Efficient Language Pre-Training via a Meta Controller [96.68140474547602]
Large-scale pre-training is computationally expensive.
ELECTRA, an early attempt to accelerate pre-training, trains a discriminative model that predicts whether each input token was replaced by a generator.
We propose a novel meta-learning framework, MC-BERT, to achieve better efficiency and effectiveness.
arXiv Detail & Related papers (2020-06-10T09:22:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.