Efficient Continual Pre-training for Building Domain Specific Large
Language Models
- URL: http://arxiv.org/abs/2311.08545v1
- Date: Tue, 14 Nov 2023 21:19:14 GMT
- Title: Efficient Continual Pre-training for Building Domain Specific Large
Language Models
- Authors: Yong Xie, Karan Aggarwal, Aitzaz Ahmad
- Abstract summary: Large language models (LLMs) have demonstrated remarkable open-domain capabilities.
Traditionally, LLMs tailored for a domain are trained from scratch to excel at handling domain-specific tasks.
We introduce FinPythia-6.9B, developed through domain-adaptive continual pre-training on the financial domain.
- Score: 8.799785664150255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable open-domain
capabilities. Traditionally, LLMs tailored for a domain are trained from
scratch to excel at handling domain-specific tasks. In this work, we explore an
alternative strategy of continual pre-training as a means to develop
domain-specific LLMs. We introduce FinPythia-6.9B, developed through
domain-adaptive continual pre-training on the financial domain. Continual
pre-trained FinPythia showcases consistent improvements on financial tasks over
the original foundational model. We further explore simple but effective data
selection strategies for continual pre-training. Our data selection strategies
outperforms vanilla continual pre-training's performance with just 10% of
corpus size and cost, without any degradation on open-domain standard tasks.
Our work proposes an alternative solution to building domain-specific LLMs from
scratch in a cost-effective manner.
Related papers
- Demystifying Domain-adaptive Post-training for Financial LLMs [79.581577578952]
FINDAP is a systematic and fine-grained investigation into domain adaptive post-training of large language models (LLMs)
Our approach consists of four key components: FinCap, FinRec, FinTrain and FinEval.
The resulting model, Llama-Fin, achieves state-of-the-art performance across a wide range of financial tasks.
arXiv Detail & Related papers (2025-01-09T04:26:15Z) - Exploring Language Model Generalization in Low-Resource Extractive QA [57.14068405860034]
We investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift.
We devise a series of experiments to explain the performance gap empirically.
arXiv Detail & Related papers (2024-09-27T05:06:43Z) - BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models [56.89958793648104]
Large Language Models (LLMs) are versatile and capable of addressing a diverse range of tasks.
Previous approaches either conduct continuous pre-training with domain-specific data or employ retrieval augmentation to support general LLMs.
We present a novel framework named BLADE, which enhances Black-box LArge language models with small Domain-spEcific models.
arXiv Detail & Related papers (2024-03-27T08:57:21Z) - Investigating Continual Pretraining in Large Language Models: Insights and Implications [9.660013084324817]
Continual learning in large language models (LLMs) is an evolving domain that focuses on developing efficient and sustainable training strategies.
We introduce a new benchmark designed to measure the adaptability of LLMs to changing pretraining data landscapes.
Our findings uncover several key insights: (i) continual pretraining consistently improves 1.5B models studied in this work and is also superior to domain adaptation, (ii) larger models always achieve better perplexity than smaller ones when continually pretrained on the same corpus, (iii) smaller models are particularly sensitive to continual pretraining, showing the most significant rates of both learning and
arXiv Detail & Related papers (2024-02-27T10:47:24Z) - EcomGPT-CT: Continual Pre-training of E-commerce Large Language Models
with Semi-structured Data [67.8302955948861]
Large Language Models (LLMs) pre-trained on massive corpora have exhibited remarkable performance on various NLP tasks.
Applying these models to specific domains still poses significant challenges, such as lack of domain knowledge.
We focus on domain-specific continual pre-training of LLMs using E-commerce domain as an exemplar.
arXiv Detail & Related papers (2023-12-25T11:31:47Z) - KALA: Knowledge-Augmented Language Model Adaptation [65.92457495576141]
We propose a novel domain adaption framework for pre-trained language models (PLMs)
Knowledge-Augmented Language model Adaptation (KALA) modulates the intermediate hidden representations of PLMs with domain knowledge.
Results show that, despite being computationally efficient, our KALA largely outperforms adaptive pre-training.
arXiv Detail & Related papers (2022-04-22T08:11:59Z) - Adapt-and-Distill: Developing Small, Fast and Effective Pretrained
Language Models for Domains [45.07506437436464]
We present a general approach to developing small, fast and effective pre-trained models for specific domains.
This is achieved by adapting the off-the-shelf general pre-trained models and performing task-agnostic knowledge distillation in target domains.
arXiv Detail & Related papers (2021-06-25T07:37:05Z) - Don't Stop Pretraining: Adapt Language Models to Domains and Tasks [81.99843216550306]
We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks.
A second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains.
Adapting to the task's unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining.
arXiv Detail & Related papers (2020-04-23T04:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.