Related papers: The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers

The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers

URL: http://arxiv.org/abs/2108.12284v1
Date: Thu, 26 Aug 2021 17:26:56 GMT
Title: The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers
Authors: R\'obert Csord\'as, Kazuki Irie, J\"urgen Schmidhuber
Abstract summary: We report improvements on five popular datasets: SCAN, CFQ, PCFG, COGS, and Mathematics. Our models improve accuracy from 50% to 85% on the PCFG productivity split, and from 35% to 81% on COGS. This calls for proper generalization validation sets for developing neural networks that generalize systematically.
Score: 8.424405898986118
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, many datasets have been proposed to test the systematic generalization ability of neural networks. The companion baseline Transformers, typically trained with default hyper-parameters from standard tasks, are shown to fail dramatically. Here we demonstrate that by revisiting model configurations as basic as scaling of embeddings, early stopping, relative positional embedding, and Universal Transformer variants, we can drastically improve the performance of Transformers on systematic generalization. We report improvements on five popular datasets: SCAN, CFQ, PCFG, COGS, and Mathematics dataset. Our models improve accuracy from 50% to 85% on the PCFG productivity split, and from 35% to 81% on COGS. On SCAN, relative positional embedding largely mitigates the EOS decision problem (Newman et al., 2020), yielding 100% accuracy on the length split with a cutoff at 26. Importantly, performance differences between these models are typically invisible on the IID data split. This calls for proper generalization validation sets for developing neural networks that generalize systematically. We publicly release the code to reproduce our results.

Related papers

Revisiting Chebyshev Polynomial and Anisotropic RBF Models for Tabular Regression [0.7734726150561086]
Smooth-basis models such as Chebyshev regressors and radial basis function (RBF) networks are well established in numerical analysis.<n>We ask whether they can compete, benchmarking models across 55 regression datasets organised by application domain.
arXiv Detail & Related papers (2026-02-25T21:34:52Z)
Plain Transformers are Surprisingly Powerful Link Predictors [57.01966734467712]
Link prediction is a core challenge in graph machine learning, demanding models that capture rich and complex topological dependencies.<n>While Graph Neural Networks (GNNs) are the standard solution, state-of-the-art pipelines often rely on explicit structurals or memory-intensive node embeddings.<n>We present PENCIL, an encoder-only plain Transformer that replaces hand-crafted priors with attention over sampled local subgraphs.
arXiv Detail & Related papers (2026-02-02T02:45:52Z)
Performance Prediction for Large Systems via Text-to-Text Regression [13.48548094460222]
We propose text-to-text regression as a general, scalable alternative to traditional regression methods.<n>For predicting resource efficiency on Borg, Google's massive compute cluster scheduling system, a 60M parameter encoder-decoder is trained.<n>The model also easily adapts to new tasks in only 500 few-shot examples and captures the densities of complex outcome distributions.
arXiv Detail & Related papers (2025-06-26T19:10:08Z)
Robust representations of oil wells' intervals via sparse attention mechanism [2.604557228169423]
We introduce the class of efficient Transformers named Regularized Transformers (Reguformers) The focus in our experiments is on oil&gas data, namely, well logs. To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells.
arXiv Detail & Related papers (2022-12-29T09:56:33Z)
Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost [53.746169882193456]
Recent works have proposed various sparse attention modules to overcome the quadratic cost of self-attention. We propose a model that resolves both problems by endowing each attention head with a mixed-membership Block Model. Our model outperforms previous efficient variants as well as the original Transformer with full attention.
arXiv Detail & Related papers (2022-10-27T15:30:52Z)
The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes [3.7189423451031356]
We propose a framework to improve generalization from small amounts of data. We augment modern CNNs with fully-connected layers and show the massive impact this architectural change has in low-data regimes.
arXiv Detail & Related papers (2022-10-11T17:55:10Z)
Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data [66.11139091362078]
We provide the first model selection results on large pretrained Transformers from Huggingface using generalization metrics. Despite their niche status, we find that metrics derived from the heavy-tail (HT) perspective are particularly useful in NLP tasks.
arXiv Detail & Related papers (2022-02-06T20:07:35Z)
FQ-ViT: Fully Quantized Vision Transformer without Retraining [13.82845665713633]
We present a systematic method to reduce the performance degradation and inference complexity of Quantized Transformers. We are the first to achieve comparable accuracy degradation (1%) on fully quantized Vision Transformers.
arXiv Detail & Related papers (2021-11-27T06:20:53Z)
Self-Supervised Pre-Training for Transformer-Based Person Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID) Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance. This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z)
The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization [8.424405898986118]
We propose two modifications to the Transformer architecture, copy gate and geometric attention. Our novel Neural Data Router (NDR) achieves 100% length generalization accuracy on the classic compositional table lookup task. NDR's attention and gating patterns tend to be interpretable as an intuitive form of neural routing.
arXiv Detail & Related papers (2021-10-14T21:24:27Z)
Inducing Transformer's Compositional Generalization Ability via Auxiliary Sequence Prediction Tasks [86.10875837475783]
Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions. Existing neural models have been shown to lack this basic ability in learning symbolic structures. We propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics.
arXiv Detail & Related papers (2021-09-30T16:41:19Z)
nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution. nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z)
Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity. Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.