MediSwift: Efficient Sparse Pre-trained Biomedical Language Models
- URL: http://arxiv.org/abs/2403.00952v1
- Date: Fri, 1 Mar 2024 20:03:44 GMT
- Title: MediSwift: Efficient Sparse Pre-trained Biomedical Language Models
- Authors: Vithursan Thangarasa, Mahmoud Salem, Shreyas Saxena, Kevin Leong, Joel
Hestness, Sean Lie
- Abstract summary: MediSwift is a suite of biomedical LMs that leverage sparse pre-training on domain-specific biomedical text data.
By inducing up to 75% weight sparsity during the pre-training phase, MediSwift achieves a 2-2.5x reduction in training FLOPs.
Our results show that sparse pre-training, along with dense fine-tuning and soft prompting, offers an effective method for creating high-performing, computationally efficient models in specialized domains.
- Score: 2.327390371420762
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are typically trained on general source data for
various domains, but a recent surge in domain-specific LLMs has shown their
potential to outperform general-purpose models in domain-specific tasks (e.g.,
biomedicine). Although domain-specific pre-training enhances efficiency and
leads to smaller models, the computational costs of training these LLMs remain
high, posing budgeting challenges. We introduce MediSwift, a suite of
biomedical LMs that leverage sparse pre-training on domain-specific biomedical
text data. By inducing up to 75% weight sparsity during the pre-training phase,
MediSwift achieves a 2-2.5x reduction in training FLOPs. Notably, all sparse
pre-training was performed on the Cerebras CS-2 system, which is specifically
designed to realize the acceleration benefits from unstructured weight
sparsity, thereby significantly enhancing the efficiency of the MediSwift
models. Through subsequent dense fine-tuning and strategic soft prompting,
MediSwift models outperform existing LLMs up to 7B parameters on biomedical
tasks, setting new benchmarks w.r.t efficiency-accuracy on tasks such as
PubMedQA. Our results show that sparse pre-training, along with dense
fine-tuning and soft prompting, offers an effective method for creating
high-performing, computationally efficient models in specialized domains.
Related papers
- The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models.
We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss.
We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.
LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.
We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs [74.35290684163718]
A primary challenge in large language model (LLM) development is their onerous pre-training cost.
This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by leveraging a small language model (SLM)
arXiv Detail & Related papers (2024-10-24T14:31:52Z) - Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation.
We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources.
We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z) - The Impact of LoRA Adapters for LLMs on Clinical NLP Classification Under Data Limitations [4.72457683445805]
Fine-tuning Large Language Models (LLMs) for clinical Natural Language Processing (NLP) poses significant challenges due to the domain gap and limited data availability.
This study investigates the effectiveness of various adapter techniques, equivalent to Low-Rank Adaptation (LoRA)
We fine-tuned biomedical pre-trained models, including CamemBERT-bio, AliBERT, and DrBERT, alongside two Transformer-based models.
arXiv Detail & Related papers (2024-07-27T16:48:03Z) - Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training.
We propose three effective strategies to enhance LLM performance within a fixed compute budget.
Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z) - Developing Healthcare Language Model Embedding Spaces [0.20971479389679337]
Pre-trained Large Language Models (LLMs) often struggle on out-of-domain datasets like healthcare focused text.
Three methods are assessed: traditional masked language modeling, Deep Contrastive Learning for Unsupervised Textual Representations (DeCLUTR) and a novel pre-training objective utilizing metadata categories from the healthcare settings.
Contrastively trained models outperform other approaches on the classification tasks, delivering strong performance from limited labeled data and with fewer model parameter updates required.
arXiv Detail & Related papers (2024-03-28T19:31:32Z) - DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing.
Existing ultra-low-bit quantization always causes severe accuracy drops.
We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z) - Improving Small Language Models on PubMedQA via Generative Data
Augmentation [4.96649519549027]
Large Language Models (LLMs) have made remarkable advancements in the field of natural language processing.
Small Language Models (SLMs) are known for their efficiency, but they often struggle with limited capacity and training data.
We introduce a novel method aimed at improving SLMs in the medical domain using LLM-based generative data augmentation.
arXiv Detail & Related papers (2023-05-12T23:49:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.