The Devil is in the Detail: Simple Tricks Improve Systematic
Generalization of Transformers
- URL: http://arxiv.org/abs/2108.12284v1
- Date: Thu, 26 Aug 2021 17:26:56 GMT
- Title: The Devil is in the Detail: Simple Tricks Improve Systematic
Generalization of Transformers
- Authors: R\'obert Csord\'as, Kazuki Irie, J\"urgen Schmidhuber
- Abstract summary: We report improvements on five popular datasets: SCAN, CFQ, PCFG, COGS, and Mathematics.
Our models improve accuracy from 50% to 85% on the PCFG productivity split, and from 35% to 81% on COGS.
This calls for proper generalization validation sets for developing neural networks that generalize systematically.
- Score: 8.424405898986118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, many datasets have been proposed to test the systematic
generalization ability of neural networks. The companion baseline Transformers,
typically trained with default hyper-parameters from standard tasks, are shown
to fail dramatically. Here we demonstrate that by revisiting model
configurations as basic as scaling of embeddings, early stopping, relative
positional embedding, and Universal Transformer variants, we can drastically
improve the performance of Transformers on systematic generalization. We report
improvements on five popular datasets: SCAN, CFQ, PCFG, COGS, and Mathematics
dataset. Our models improve accuracy from 50% to 85% on the PCFG productivity
split, and from 35% to 81% on COGS. On SCAN, relative positional embedding
largely mitigates the EOS decision problem (Newman et al., 2020), yielding 100%
accuracy on the length split with a cutoff at 26. Importantly, performance
differences between these models are typically invisible on the IID data split.
This calls for proper generalization validation sets for developing neural
networks that generalize systematically. We publicly release the code to
reproduce our results.
Related papers
- Robust representations of oil wells' intervals via sparse attention
mechanism [2.604557228169423]
We introduce the class of efficient Transformers named Regularized Transformers (Reguformers)
The focus in our experiments is on oil&gas data, namely, well logs.
To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells.
arXiv Detail & Related papers (2022-12-29T09:56:33Z) - Transformers meet Stochastic Block Models: Attention with Data-Adaptive
Sparsity and Cost [53.746169882193456]
Recent works have proposed various sparse attention modules to overcome the quadratic cost of self-attention.
We propose a model that resolves both problems by endowing each attention head with a mixed-membership Block Model.
Our model outperforms previous efficient variants as well as the original Transformer with full attention.
arXiv Detail & Related papers (2022-10-27T15:30:52Z) - The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data
Regimes [3.7189423451031356]
We propose a framework to improve generalization from small amounts of data.
We augment modern CNNs with fully-connected layers and show the massive impact this architectural change has in low-data regimes.
arXiv Detail & Related papers (2022-10-11T17:55:10Z) - Evaluating natural language processing models with generalization
metrics that do not need access to any training or testing data [66.11139091362078]
We provide the first model selection results on large pretrained Transformers from Huggingface using generalization metrics.
Despite their niche status, we find that metrics derived from the heavy-tail (HT) perspective are particularly useful in NLP tasks.
arXiv Detail & Related papers (2022-02-06T20:07:35Z) - FQ-ViT: Fully Quantized Vision Transformer without Retraining [13.82845665713633]
We present a systematic method to reduce the performance degradation and inference complexity of Quantized Transformers.
We are the first to achieve comparable accuracy degradation (1%) on fully quantized Vision Transformers.
arXiv Detail & Related papers (2021-11-27T06:20:53Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - The Neural Data Router: Adaptive Control Flow in Transformers Improves
Systematic Generalization [8.424405898986118]
We propose two modifications to the Transformer architecture, copy gate and geometric attention.
Our novel Neural Data Router (NDR) achieves 100% length generalization accuracy on the classic compositional table lookup task.
NDR's attention and gating patterns tend to be interpretable as an intuitive form of neural routing.
arXiv Detail & Related papers (2021-10-14T21:24:27Z) - Inducing Transformer's Compositional Generalization Ability via
Auxiliary Sequence Prediction Tasks [86.10875837475783]
Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions.
Existing neural models have been shown to lack this basic ability in learning symbolic structures.
We propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics.
arXiv Detail & Related papers (2021-09-30T16:41:19Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.