Efficient Continual Pre-training for Building Domain Specific Large
Language Models
- URL: http://arxiv.org/abs/2311.08545v1
- Date: Tue, 14 Nov 2023 21:19:14 GMT
- Title: Efficient Continual Pre-training for Building Domain Specific Large
Language Models
- Authors: Yong Xie, Karan Aggarwal, Aitzaz Ahmad
- Abstract summary: Large language models (LLMs) have demonstrated remarkable open-domain capabilities.
Traditionally, LLMs tailored for a domain are trained from scratch to excel at handling domain-specific tasks.
We introduce FinPythia-6.9B, developed through domain-adaptive continual pre-training on the financial domain.
- Score: 8.799785664150255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable open-domain
capabilities. Traditionally, LLMs tailored for a domain are trained from
scratch to excel at handling domain-specific tasks. In this work, we explore an
alternative strategy of continual pre-training as a means to develop
domain-specific LLMs. We introduce FinPythia-6.9B, developed through
domain-adaptive continual pre-training on the financial domain. Continual
pre-trained FinPythia showcases consistent improvements on financial tasks over
the original foundational model. We further explore simple but effective data
selection strategies for continual pre-training. Our data selection strategies
outperforms vanilla continual pre-training's performance with just 10% of
corpus size and cost, without any degradation on open-domain standard tasks.
Our work proposes an alternative solution to building domain-specific LLMs from
scratch in a cost-effective manner.
Related papers
- LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.10969986056]
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications.
Post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations.
arXiv Detail & Related papers (2025-02-28T18:59:54Z) - Demystifying Domain-adaptive Post-training for Financial LLMs [79.581577578952]
FINDAP is a systematic and fine-grained investigation into domain adaptive post-training of large language models (LLMs)
Our approach consists of four key components: FinCap, FinRec, FinTrain and FinEval.
The resulting model, Llama-Fin, achieves state-of-the-art performance across a wide range of financial tasks.
arXiv Detail & Related papers (2025-01-09T04:26:15Z) - DoPAMine: Domain-specific Pre-training Adaptation from seed-guided data Mining [2.1534028009401713]
Large Language Models (LLMs) have shown ability to generalize effectively across numerous industry domains.
LLMs exhibit limitations when tasked with performing in specialized or low-resource industry domains.
In this work, we propose an automated and scalable framework - DoPAMine:Domain-specific Pre-training Adaptation from seed-guided data Mining.
arXiv Detail & Related papers (2024-09-30T22:15:58Z) - Exploring Language Model Generalization in Low-Resource Extractive QA [57.14068405860034]
We investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift.
We devise a series of experiments to empirically explain the performance gap.
arXiv Detail & Related papers (2024-09-27T05:06:43Z) - BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models [56.89958793648104]
Large Language Models (LLMs) are versatile and capable of addressing a diverse range of tasks.
Previous approaches either conduct continuous pre-training with domain-specific data or employ retrieval augmentation to support general LLMs.
We present a novel framework named BLADE, which enhances Black-box LArge language models with small Domain-spEcific models.
arXiv Detail & Related papers (2024-03-27T08:57:21Z) - Investigating Continual Pretraining in Large Language Models: Insights
and Implications [9.591223887442704]
This paper studies the evolving domain of Continual Learning in large language models (LLMs)
Our primary emphasis is on continual domain-adaptive pretraining, a process designed to equip LLMs with the ability to integrate new information from various domains.
We examine the impact of model size on learning efficacy and forgetting, as well as how the progression and similarity of emerging domains affect the knowledge transfer within these models.
arXiv Detail & Related papers (2024-02-27T10:47:24Z) - EcomGPT-CT: Continual Pre-training of E-commerce Large Language Models
with Semi-structured Data [67.8302955948861]
Large Language Models (LLMs) pre-trained on massive corpora have exhibited remarkable performance on various NLP tasks.
Applying these models to specific domains still poses significant challenges, such as lack of domain knowledge.
We focus on domain-specific continual pre-training of LLMs using E-commerce domain as an exemplar.
arXiv Detail & Related papers (2023-12-25T11:31:47Z) - KALA: Knowledge-Augmented Language Model Adaptation [65.92457495576141]
We propose a novel domain adaption framework for pre-trained language models (PLMs)
Knowledge-Augmented Language model Adaptation (KALA) modulates the intermediate hidden representations of PLMs with domain knowledge.
Results show that, despite being computationally efficient, our KALA largely outperforms adaptive pre-training.
arXiv Detail & Related papers (2022-04-22T08:11:59Z) - Adapt-and-Distill: Developing Small, Fast and Effective Pretrained
Language Models for Domains [45.07506437436464]
We present a general approach to developing small, fast and effective pre-trained models for specific domains.
This is achieved by adapting the off-the-shelf general pre-trained models and performing task-agnostic knowledge distillation in target domains.
arXiv Detail & Related papers (2021-06-25T07:37:05Z) - Source-Free Open Compound Domain Adaptation in Semantic Segmentation [99.82890571842603]
In SF-OCDA, only the source pre-trained model and the target data are available to learn the target model.
We propose the Cross-Patch Style Swap (CPSS) to diversify samples with various patch styles in the feature-level.
Our method produces state-of-the-art results on the C-Driving dataset.
arXiv Detail & Related papers (2021-06-07T08:38:41Z) - Don't Stop Pretraining: Adapt Language Models to Domains and Tasks [81.99843216550306]
We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks.
A second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains.
Adapting to the task's unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining.
arXiv Detail & Related papers (2020-04-23T04:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.