"Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow
- URL: http://arxiv.org/abs/2306.03268v2
- Date: Wed, 24 Jan 2024 07:53:30 GMT
- Title: "Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow
- Authors: Manisha Mukherjee, Vincent J. Hellendoorn
- Abstract summary: We train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $$187$ and $$800$ each.
Results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
- Score: 5.036273913335737
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large pre-trained neural language models have brought immense progress to
both NLP and software engineering. Models in OpenAI's GPT series now dwarf
Google's BERT and Meta's RoBERTa, which previously set new benchmarks on a wide
range of NLP applications. These models are trained on massive corpora of
heterogeneous data from web crawls, which enables them to learn general
language patterns and semantic relationships. However, the largest models are
both expensive to train and deploy and are often closed-source, so we lack
access to their data and design decisions. We argue that this trend towards
large, general-purpose models should be complemented with single-purpose, more
modestly sized pre-trained models. In this work, we take StackOverflow (SO) as
a domain example in which large volumes of rich aligned code and text data is
available. We adopt standard practices for pre-training large language models,
including using a very large context size (2,048 tokens), batch size (0.5M
tokens) and training set (27B tokens), coupled with a powerful toolkit
(Megatron-LM), to train two models: SOBertBase, with 109M parameters, and
SOBertLarge with 762M parameters, at a budget of just $\$187$ and $\$800$ each.
We compare the performance of our models with both the previous SOTA model
trained on SO data exclusively as well general-purpose BERT models and OpenAI's
ChatGPT on four SO-specific downstream tasks - question quality prediction,
closed question prediction, named entity recognition and obsoletion prediction
(a new task we introduce). Not only do our models consistently outperform all
baselines, the smaller model is often sufficient for strong results. Both
models are released to the public. These results demonstrate that pre-training
both extensively and properly on in-domain data can yield a powerful and
affordable alternative to leveraging closed-source general-purpose models.
Related papers
- nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales [65.01417261415833]
We present an approach to predict the pre-training loss based on our observations that Maximal Update Parametrization (muP) enables accurate fitting of scaling laws.
With around 14% of the one-time pre-training cost, we can accurately forecast the loss for models up to 52B.
Our goal with nanoLM is to empower researchers with limited resources to reach meaningful conclusions on large models.
arXiv Detail & Related papers (2023-04-14T00:45:01Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey [66.18478838828231]
Multi-modal pre-trained big models have drawn more and more attention in recent years.
This paper introduces the background of multi-modal pre-training by reviewing the conventional deep, pre-training works in natural language process, computer vision, and speech.
Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network, and knowledge enhanced pre-training.
arXiv Detail & Related papers (2023-02-20T15:34:03Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Revealing Secrets From Pre-trained Models [2.0249686991196123]
Transfer-learning has been widely adopted in many emerging deep learning algorithms.
We show that pre-trained models and fine-tuned models have significantly high similarities in weight values.
We propose a new model extraction attack that reveals the model architecture and the pre-trained model used by the black-box victim model.
arXiv Detail & Related papers (2022-07-19T20:19:03Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - It's the Best Only When It Fits You Most: Finding Related Models for
Serving Based on Dynamic Locality Sensitive Hashing [1.581913948762905]
Preparation of training data is often a bottleneck in the lifecycle of deploying a deep learning model for production or research.
This paper proposes an end-to-end process of searching related models for serving based on the similarity of the target dataset and the training datasets of the available models.
arXiv Detail & Related papers (2020-10-13T22:52:13Z) - A Comparison of LSTM and BERT for Small Corpus [0.0]
Recent advancements in the NLP field showed that transfer learning helps with achieving state-of-the-art results for new tasks by tuning pre-trained models instead of starting from scratch.
In this paper we focus on a real-life scenario that scientists in academia and industry face frequently: given a small dataset, can we use a large pre-trained model like BERT and get better results than simple models?
Our experimental results show that bidirectional LSTM models can achieve significantly higher results than a BERT model for a small dataset and these simple models get trained in much less time than tuning the pre-trained counterparts.
arXiv Detail & Related papers (2020-09-11T14:01:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.