A Comparison of LSTM and BERT for Small Corpus
- URL: http://arxiv.org/abs/2009.05451v1
- Date: Fri, 11 Sep 2020 14:01:14 GMT
- Title: A Comparison of LSTM and BERT for Small Corpus
- Authors: Aysu Ezen-Can
- Abstract summary: Recent advancements in the NLP field showed that transfer learning helps with achieving state-of-the-art results for new tasks by tuning pre-trained models instead of starting from scratch.
In this paper we focus on a real-life scenario that scientists in academia and industry face frequently: given a small dataset, can we use a large pre-trained model like BERT and get better results than simple models?
Our experimental results show that bidirectional LSTM models can achieve significantly higher results than a BERT model for a small dataset and these simple models get trained in much less time than tuning the pre-trained counterparts.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in the NLP field showed that transfer learning helps with
achieving state-of-the-art results for new tasks by tuning pre-trained models
instead of starting from scratch. Transformers have made a significant
improvement in creating new state-of-the-art results for many NLP tasks
including but not limited to text classification, text generation, and sequence
labeling. Most of these success stories were based on large datasets. In this
paper we focus on a real-life scenario that scientists in academia and industry
face frequently: given a small dataset, can we use a large pre-trained model
like BERT and get better results than simple models? To answer this question,
we use a small dataset for intent classification collected for building
chatbots and compare the performance of a simple bidirectional LSTM model with
a pre-trained BERT model. Our experimental results show that bidirectional LSTM
models can achieve significantly higher results than a BERT model for a small
dataset and these simple models get trained in much less time than tuning the
pre-trained counterparts. We conclude that the performance of a model is
dependent on the task and the data, and therefore before making a model choice,
these factors should be taken into consideration instead of directly choosing
the most popular model.
Related papers
- Data Diet: Can Trimming PET/CT Datasets Enhance Lesion Segmentation? [70.38903555729081]
We describe our approach to compete in the autoPET3 datacentric track.
We find that in the autoPETIII dataset, a model that is trained on the entire dataset exhibits undesirable characteristics.
We counteract this by removing the easiest samples from the training dataset as measured by the model loss before retraining from scratch.
arXiv Detail & Related papers (2024-09-20T14:47:58Z) - Enabling Small Models for Zero-Shot Classification through Model Label Learning [50.68074833512999]
We introduce a novel paradigm, Model Label Learning (MLL), which bridges the gap between models and their functionalities.
Experiments on seven real-world datasets validate the effectiveness and efficiency of MLL.
arXiv Detail & Related papers (2024-08-21T09:08:26Z) - A synthetic data approach for domain generalization of NLI models [13.840374911669167]
Natural Language Inference (NLI) remains an important benchmark task for LLMs.
We show that synthetic high-quality datasets can adapt NLI models for zero-shot use in downstream applications.
We show that models trained on this data have the best generalization to completely new downstream test settings.
arXiv Detail & Related papers (2024-02-19T18:55:16Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Revealing Secrets From Pre-trained Models [2.0249686991196123]
Transfer-learning has been widely adopted in many emerging deep learning algorithms.
We show that pre-trained models and fine-tuned models have significantly high similarities in weight values.
We propose a new model extraction attack that reveals the model architecture and the pre-trained model used by the black-box victim model.
arXiv Detail & Related papers (2022-07-19T20:19:03Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - Short-answer scoring with ensembles of pretrained language models [0.0]
We fine-tune a collection of popular small, base, and large pretrained transformer-based language models.
We train one feature-base model on the dataset with the aim of testing ensembles of these models.
We observe that generally that the larger models perform slightly better, however, they still fall short of state-of-the-art results one their own.
arXiv Detail & Related papers (2022-02-23T15:12:20Z) - A Comparative Study of Transformer-Based Language Models on Extractive
Question Answering [0.5079811885340514]
We train various pre-trained language models and fine-tune them on multiple question answering datasets.
Using the F1-score as our metric, we find that the RoBERTa and BART pre-trained models perform the best across all datasets.
arXiv Detail & Related papers (2021-10-07T02:23:19Z) - Multi-stage Pre-training over Simplified Multimodal Pre-training Models [35.644196343835674]
We propose a new Multi-stage Pre-training (MSP) method, which uses information at different granularities from word, phrase to sentence in both texts and images to pre-train the model in stages.
We also design several different pre-training tasks suitable for the information granularity in different stage in order to efficiently capture the diverse knowledge from a limited corpus.
Experimental results show that our method achieves comparable performance to the original LXMERT model in all downstream tasks, and even outperforms the original model in Image-Text Retrieval task.
arXiv Detail & Related papers (2021-07-22T03:35:27Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.