Sharpness-Aware Minimization Improves Language Model Generalization
- URL: http://arxiv.org/abs/2110.08529v1
- Date: Sat, 16 Oct 2021 09:44:06 GMT
- Title: Sharpness-Aware Minimization Improves Language Model Generalization
- Authors: Dara Bahri and Hossein Mobahi and Yi Tay
- Abstract summary: We show that Sharpness-Aware Minimization (SAM) can substantially improve the generalization of language models without much computational overhead.
We show that SAM is able to boost performance on SuperGLUE, GLUE, Web Questions, Natural Questions, Trivia QA, and TyDiQA, with particularly large gains when training data for these tasks is limited.
- Score: 46.83888240127077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The allure of superhuman-level capabilities has led to considerable interest
in language models like GPT-3 and T5, wherein the research has, by and large,
revolved around new model architectures, training tasks, and loss objectives,
along with substantial engineering efforts to scale up model capacity and
dataset size. Comparatively little work has been done to improve the
generalization of these models through better optimization. In this work, we
show that Sharpness-Aware Minimization (SAM), a recently proposed optimization
procedure that encourages convergence to flatter minima, can substantially
improve the generalization of language models without much computational
overhead. We show that SAM is able to boost performance on SuperGLUE, GLUE, Web
Questions, Natural Questions, Trivia QA, and TyDiQA, with particularly large
gains when training data for these tasks is limited.
Related papers
- Super Tiny Language Models [3.8353434814956517]
This paper introduces a series of research efforts focused on Super Tiny Language Models (STLMs)
We explore innovative techniques such as byte-level tokenization with a pooling mechanism, weight tying, and efficient training strategies.
Our ultimate goal is to make high-performance language models more accessible and practical for a wide range of applications.
arXiv Detail & Related papers (2024-05-23T04:12:49Z) - The Truth is in There: Improving Reasoning in Language Models with
Layer-Selective Rank Reduction [22.659005954676598]
We show that it is possible to significantly improve the performance of Large Language Models by selectively removing higher-order components of their weight matrices.
This simple intervention, which we call LAyer-SElective Rank reduction (LASER), can be done on a model after training has completed.
We show extensive experiments demonstrating the generality of this finding across language models and datasets.
arXiv Detail & Related papers (2023-12-21T03:51:08Z) - Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z) - Teaching Language Models to Self-Improve through Interactive Demonstrations [83.9421355808174]
Self-improving ability of large language models has been shown to be absent and difficult to learn for smaller models.
We introduce TriPosT, a training algorithm that endows smaller models with such self-improvement ability.
We show that our approach can improve a LLaMA-7b's performance on math and reasoning tasks by up to 7.13%.
arXiv Detail & Related papers (2023-10-20T14:11:04Z) - Small Models are Valuable Plug-ins for Large Language Models [65.29370906766997]
Large language models (LLMs) such as GPT-3 and GPT-4 are powerful but their weights are often publicly unavailable.
We propose Super In-Context Learning (SuperICL) which allows black-box LLMs to work with locally fine-tuned smaller models.
arXiv Detail & Related papers (2023-05-15T17:59:01Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z) - Go-tuning: Improving Zero-shot Learning Abilities of Smaller Language
Models [23.818751895205132]
Go-tuning is a geometry-guided self-supervised learning method.
Go-tuning can enable T5-small (80M) competitive zero-shot results compared with large language models, such as T5-XL (3B)
arXiv Detail & Related papers (2022-12-20T17:36:49Z) - Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training.
We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark.
In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.