Optimizing Deeper Transformers on Small Datasets: An Application on
Text-to-SQL Semantic Parsing
- URL: http://arxiv.org/abs/2012.15355v1
- Date: Wed, 30 Dec 2020 22:53:49 GMT
- Title: Optimizing Deeper Transformers on Small Datasets: An Application on
Text-to-SQL Semantic Parsing
- Authors: Peng Xu, Wei Yang, Wenjie Zi, Keyi Tang, Chengyang Huang, Jackie Chi
Kit Cheung, Yanshuai Cao
- Abstract summary: We show that the benefits of very deep transformers are shown to carry over to hard structural prediction tasks.
In particular, we successfully train 48 layers of transformers for a semantic parsing task.
- Score: 23.280034406077757
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Due to the common belief that training deep transformers from scratch
requires large datasets, people usually only use shallow and simple additional
layers on top of pre-trained models during fine-tuning on small datasets. We
provide evidence that this does not always need to be the case: with proper
initialization and training techniques, the benefits of very deep transformers
are shown to carry over to hard structural prediction tasks, even using small
datasets. In particular, we successfully train 48 layers of transformers for a
semantic parsing task. These comprise 24 fine-tuned transformer layers from
pre-trained RoBERTa and 24 relation-aware transformer layers trained from
scratch. With fewer training steps and no task-specific pre-training, we obtain
the state of the art performance on the challenging cross-domain Text-to-SQL
semantic parsing benchmark Spider. We achieve this by deriving a novel Data
dependent Transformer Fixed-update initialization scheme (DT-Fixup), inspired
by the prior T-Fixup work. Further error analysis demonstrates that increasing
the depth of the transformer model can help improve generalization on the cases
requiring reasoning and structural understanding.
Related papers
- Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - On the Effect of Pre-training for Transformer in Different Modality on
Offline Reinforcement Learning [0.0]
We investigate how pre-training on data of different modalities, such as language and vision, affects fine-tuning of Transformer-based models to Mujoco offline reinforcement learning tasks.
arXiv Detail & Related papers (2022-11-17T13:34:08Z) - Discriminative and Generative Transformer-based Models For Situation
Entity Classification [8.029049649310211]
We re-examine the situation entity (SE) classification task with varying amounts of available training data.
We exploit a Transformer-based variational autoencoder to encode sentences into a lower dimensional latent space.
arXiv Detail & Related papers (2021-09-15T17:07:07Z) - DoT: An efficient Double Transformer for NLP tasks with tables [3.0079490585515343]
DoT is a double transformer model that decomposes the problem into two sub-tasks.
We show that for a small drop of accuracy, DoT improves training and inference time by at least 50%.
arXiv Detail & Related papers (2021-06-01T13:33:53Z) - AutoTrans: Automating Transformer Design via Reinforced Architecture
Search [52.48985245743108]
This paper empirically explore how to set layer-norm, whether to scale, number of layers, number of heads, activation function, etc, so that one can obtain a transformer architecture that better suits the tasks at hand.
Experiments on the CoNLL03, Multi-30k, IWSLT14 and WMT-14 shows that the searched transformer model can outperform the standard transformers.
arXiv Detail & Related papers (2020-09-04T08:46:22Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z) - T-VSE: Transformer-Based Visual Semantic Embedding [5.317624228510748]
We show that transformer-based cross-modal embeddings outperform word average and RNN-based embeddings by a large margin, when trained on a large dataset of e-commerce product image-title pairs.
arXiv Detail & Related papers (2020-05-17T23:40:33Z) - On Layer Normalization in the Transformer Architecture [112.40350994368741]
We first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters.
We show in experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines.
arXiv Detail & Related papers (2020-02-12T00:33:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.