Two-stage LLM Fine-tuning with Less Specialization and More
Generalization
- URL: http://arxiv.org/abs/2211.00635v3
- Date: Tue, 12 Mar 2024 22:05:53 GMT
- Title: Two-stage LLM Fine-tuning with Less Specialization and More
Generalization
- Authors: Yihan Wang, Si Si, Daliang Li, Michal Lukasik, Felix Yu, Cho-Jui
Hsieh, Inderjit S Dhillon, Sanjiv Kumar
- Abstract summary: We propose Prompt Tuning with MOdel Tuning (ProMoT) to reduce format specialization and improve generalization.
ProMoT offloads task-specific format learning into additional and removable parameters by first doing prompt tuning and then fine-tuning the model itself with this soft prompt.
ProMoT can even enhance generalization on in-context learning tasks that are semantically related to the fine-tuned task.
- Score: 93.12197594813378
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pretrained large language models (LLMs) are general purpose problem solvers
applicable to a diverse set of tasks with prompts. They can be further improved
towards a specific task by fine-tuning on a specialized dataset. However,
fine-tuning usually makes the model narrowly specialized on this dataset with
reduced general in-context learning performances, which is undesirable whenever
the fine-tuned model needs to handle additional tasks where no fine-tuning data
is available. In this work, we first demonstrate that fine-tuning on a single
task indeed decreases LLMs' general in-context learning performance. We
discover one important cause of such forgetting, format specialization, where
the model overfits to the format of the fine-tuned task.We further show that
format specialization happens at the very beginning of fine-tuning. To solve
this problem, we propose Prompt Tuning with MOdel Tuning (ProMoT), a simple yet
effective two-stage fine-tuning framework that reduces format specialization
and improves generalization.ProMoT offloads task-specific format learning into
additional and removable parameters by first doing prompt tuning and then
fine-tuning the model itself with this soft prompt attached. With experiments
on several fine-tuning tasks and 8 in-context evaluation tasks, we show that
ProMoT achieves comparable performance on fine-tuned tasks to standard
fine-tuning, but with much less loss of in-context learning performances across
a board range of out-of-domain evaluation tasks. More importantly, ProMoT can
even enhance generalization on in-context learning tasks that are semantically
related to the fine-tuned task, e.g. ProMoT on En-Fr translation significantly
improves performance on other language pairs, and ProMoT on NLI improves
performance on summarization. Experiments also show that ProMoT can improve the
generalization performance of multi-task training.
Related papers
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners [8.707819647492467]
We explore capturing the task-specific information via meticulous refinement of entire Vision-Language Models (VLMs)
To mitigate these issues, we propose a framework named CLIP-CITE via designing a discriminative visual-text task.
arXiv Detail & Related papers (2024-07-04T15:22:54Z) - TAIA: Large Language Models are Out-of-Distribution Data Learners [30.57872423927015]
We propose an effective inference-time intervention method: Training All parameters but Inferring with only Attention (trainallInfAttn)
trainallInfAttn achieves superior improvements compared to both the fully fine-tuned model and the base model in most scenarios.
The high tolerance of trainallInfAttn to data mismatches makes it resistant to jailbreaking tuning and enhances specialized tasks using general data.
arXiv Detail & Related papers (2024-05-30T15:57:19Z) - Unveiling the Generalization Power of Fine-Tuned Large Language Models [81.70754292058258]
We investigate whether fine-tuning affects the intrinsic generalization ability intrinsic to Large Language Models (LLMs)
Our main findings reveal that models fine-tuned on generation and classification tasks exhibit dissimilar behaviors in generalizing to different domains and tasks.
We observe that integrating the in-context learning strategy during fine-tuning on generation tasks can enhance the model's generalization ability.
arXiv Detail & Related papers (2024-03-14T08:18:59Z) - LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views [28.081794908107604]
Fine-tuning is used to leverage the power of pre-trained foundation models in new downstream tasks.
Recent studies have observed challenges in the generalization of fine-tuned models to unseen distributions.
We propose a novel generalizable fine-tuning method LEVI, where the pre-trained model is adaptively ensembled layer-wise with a small task-specific model.
arXiv Detail & Related papers (2024-02-07T08:16:40Z) - Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models.
Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z) - Preventing Catastrophic Forgetting in Continual Learning of New Natural
Language Tasks [17.879087904904935]
Multi-Task Learning (MTL) is widely-accepted in Natural Language Processing as a standard technique for learning multiple related tasks in one model.
As systems usually evolve over time, adding a new task to an existing MTL model usually requires retraining the model from scratch on all the tasks.
In this paper, we approach the problem of incrementally expanding MTL models' capability to solve new tasks over time by distilling the knowledge of an already trained model on n tasks into a new one for solving n+1 tasks.
arXiv Detail & Related papers (2023-02-22T00:18:25Z) - Task Adaptive Parameter Sharing for Multi-Task Learning [114.80350786535952]
Adaptive Task Adapting Sharing (TAPS) is a method for tuning a base model to a new task by adaptively modifying a small, task-specific subset of layers.
Compared to other methods, TAPS retains high accuracy on downstream tasks while introducing few task-specific parameters.
We evaluate our method on a suite of fine-tuning tasks and architectures (ResNet, DenseNet, ViT) and show that it achieves state-of-the-art performance while being simple to implement.
arXiv Detail & Related papers (2022-03-30T23:16:07Z) - Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language
Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks.
Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients.
We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.