BudgetLongformer: Can we Cheaply Pretrain a SotA Legal Language Model
From Scratch?
- URL: http://arxiv.org/abs/2211.17135v1
- Date: Wed, 30 Nov 2022 16:09:20 GMT
- Title: BudgetLongformer: Can we Cheaply Pretrain a SotA Legal Language Model
From Scratch?
- Authors: Joel Niklaus, Daniele Giofr\'e
- Abstract summary: We train Longformer models with the efficient RTD task on legal data to showcase that pretraining efficient LMs is possible using much less compute.
We find that both the small and base models outperform their baselines on the in-domain BillSum and out-of-domain tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Pretrained transformer models have achieved state-of-the-art results in many
tasks and benchmarks recently. Many state-of-the-art Language Models (LMs),
however, do not scale well above the threshold of 512 input tokens. In
specialized domains though (such as legal, scientific or biomedical), models
often need to process very long text (sometimes well above 10000 tokens). Even
though many efficient transformers have been proposed (such as Longformer,
BigBird or FNet), so far, only very few such efficient models are available for
specialized domains. Additionally, since the pretraining process is extremely
costly in general - but even more so as the sequence length increases - it is
often only in reach of large research labs. One way of making pretraining
cheaper is the Replaced Token Detection (RTD) task, by providing more signal
during training, since the loss can be computed over all tokens. In this work,
we train Longformer models with the efficient RTD task on legal data to
showcase that pretraining efficient LMs is possible using much less compute. We
evaluate the trained models on challenging summarization tasks requiring the
model to summarize long texts to show to what extent the models can achieve
good performance on downstream tasks. We find that both the small and base
models outperform their baselines on the in-domain BillSum and out-of-domain
PubMed tasks in their respective parameter range. We publish our code and
models for research purposes.
Related papers
- Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs.
However, there remain gaps between current studies and how language models are trained.
In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z) - Continual Pre-Training of Large Language Models: How to (re)warm your
model? [21.8468835868142]
Large language models (LLMs) are routinely pre-trained on tokens, only to restart the process over again once new data becomes available.
We study the warmup phase of models pretrained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens)
Our results show that while re-warming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance, outperforming models trained from scratch$ billionsx2013$even for a large downstream dataset.
arXiv Detail & Related papers (2023-08-08T03:18:18Z) - "Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow [5.036273913335737]
We train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $$187$ and $$800$ each.
Results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
arXiv Detail & Related papers (2023-06-05T21:38:30Z) - nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales [65.01417261415833]
We present an approach to predict the pre-training loss based on our observations that Maximal Update Parametrization (muP) enables accurate fitting of scaling laws.
With around 14% of the one-time pre-training cost, we can accurately forecast the loss for models up to 52B.
Our goal with nanoLM is to empower researchers with limited resources to reach meaningful conclusions on large models.
arXiv Detail & Related papers (2023-04-14T00:45:01Z) - $\Delta$-Patching: A Framework for Rapid Adaptation of Pre-trained
Convolutional Networks without Base Performance Loss [71.46601663956521]
Models pre-trained on large-scale datasets are often fine-tuned to support newer tasks and datasets that arrive over time.
We propose $Delta$-Patching for fine-tuning neural network models in an efficient manner, without the need to store model copies.
Our experiments show that $Delta$-Networks outperform earlier model patching work while only requiring a fraction of parameters to be trained.
arXiv Detail & Related papers (2023-03-26T16:39:44Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - BERTIN: Efficient Pre-Training of a Spanish Language Model using
Perplexity Sampling [0.0]
Common Crawl might contain enough noise to make this pre-training sub-optimal.
We present a novel data-centric technique which enables the pre-training of language models in roughly half the amount of steps.
Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget.
arXiv Detail & Related papers (2022-07-14T10:48:42Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.