Update Frequently, Update Fast: Retraining Semantic Parsing Systems in a
Fraction of Time
- URL: http://arxiv.org/abs/2010.07865v2
- Date: Mon, 22 Mar 2021 16:33:55 GMT
- Title: Update Frequently, Update Fast: Retraining Semantic Parsing Systems in a
Fraction of Time
- Authors: Vladislav Lialin, Rahul Goel, Andrey Simanovsky, Anna Rumshisky,
Rushin Shah
- Abstract summary: We show that it is possible to match the performance of a model trained from scratch in less than 10% of a time via fine-tuning.
We demonstrate the effectiveness of our method on multiple splits of the Facebook TOP and SNIPS datasets.
- Score: 11.035461657669096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Currently used semantic parsing systems deployed in voice assistants can
require weeks to train. Datasets for these models often receive small and
frequent updates, data patches. Each patch requires training a new model. To
reduce training time, one can fine-tune the previously trained model on each
patch, but naive fine-tuning exhibits catastrophic forgetting - degradation of
the model performance on the data not represented in the data patch. In this
work, we propose a simple method that alleviates catastrophic forgetting and
show that it is possible to match the performance of a model trained from
scratch in less than 10% of a time via fine-tuning. The key to achieving this
is supersampling and EWC regularization. We demonstrate the effectiveness of
our method on multiple splits of the Facebook TOP and SNIPS datasets.
Related papers
- Patch-Level Training for Large Language Models [69.67438563485887]
This paper introduces patch-level training for Large Language Models (LLMs)
During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch.
Following this, the model continues token-level training on the remaining training data to align with the inference mode.
arXiv Detail & Related papers (2024-07-17T15:48:39Z) - Scalable Extraction of Training Data from (Production) Language Models [93.7746567808049]
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset.
We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT.
arXiv Detail & Related papers (2023-11-28T18:47:03Z) - Continual Pre-Training of Large Language Models: How to (re)warm your
model? [21.8468835868142]
Large language models (LLMs) are routinely pre-trained on tokens, only to restart the process over again once new data becomes available.
We study the warmup phase of models pretrained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens)
Our results show that while re-warming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance, outperforming models trained from scratch$ billionsx2013$even for a large downstream dataset.
arXiv Detail & Related papers (2023-08-08T03:18:18Z) - Catastrophic Forgetting in the Context of Model Updates [0.360953887026184]
Deep neural networks can cost many thousands of dollars to train.
When new data comes in the pipeline, you can train a new model from scratch on all existing data.
The former is costly and slow. The latter is cheap and fast, but catastrophic forgetting generally causes the new model to 'forget' how to classify older data well.
arXiv Detail & Related papers (2023-06-16T21:21:41Z) - $\Delta$-Patching: A Framework for Rapid Adaptation of Pre-trained
Convolutional Networks without Base Performance Loss [71.46601663956521]
Models pre-trained on large-scale datasets are often fine-tuned to support newer tasks and datasets that arrive over time.
We propose $Delta$-Patching for fine-tuning neural network models in an efficient manner, without the need to store model copies.
Our experiments show that $Delta$-Networks outperform earlier model patching work while only requiring a fraction of parameters to be trained.
arXiv Detail & Related papers (2023-03-26T16:39:44Z) - FlexiViT: One Model for All Patch Sizes [100.52574011880571]
Vision Transformers convert images to sequences by slicing them into patches.
The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost.
We show that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes.
arXiv Detail & Related papers (2022-12-15T18:18:38Z) - RealPatch: A Statistical Matching Framework for Model Patching with Real
Samples [6.245453620070586]
RealPatch is a framework for simpler, faster, and more data-efficient data augmentation based on statistical matching.
We show that RealPatch can successfully eliminate dataset leakage while reducing model leakage and maintaining high utility.
arXiv Detail & Related papers (2022-08-03T16:22:30Z) - Forward Compatible Training for Representation Learning [53.300192863727226]
backward compatible training (BCT) modifies training of the new model to make its representations compatible with those of the old model.
BCT can significantly hinder the performance of the new model.
In this work, we propose a new learning paradigm for representation learning: forward compatible training (FCT)
arXiv Detail & Related papers (2021-12-06T06:18:54Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z) - A Practical Incremental Method to Train Deep CTR Models [37.54660958085938]
We introduce a practical incremental method to train deep CTR models, which consists of three decoupled modules.
Our method can achieve comparable performance to the conventional batch mode training with much better training efficiency.
arXiv Detail & Related papers (2020-09-04T12:35:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.