Do Transformers use variable binding?
- URL: http://arxiv.org/abs/2203.00162v1
- Date: Sat, 19 Feb 2022 09:56:38 GMT
- Title: Do Transformers use variable binding?
- Authors: Tommi Gr\"ondahl and N. Asokan
- Abstract summary: Increasing the explainability of deep neural networks (DNNs) requires evaluating whether they implement symbolic computation.
One central symbolic capacity is variable binding: linking an input value to an abstract variable held in system-internal memory.
We provide the first systematic evaluation of the variable binding capacities of the state-of-the-art Transformer networks BERT and RoBERTa.
- Score: 14.222494511474103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Increasing the explainability of deep neural networks (DNNs) requires
evaluating whether they implement symbolic computation. One central symbolic
capacity is variable binding: linking an input value to an abstract variable
held in system-internal memory. Prior work on the computational abilities of
DNNs has not resolved the question of whether their internal processes involve
variable binding. We argue that the reason for this is fundamental, inherent in
the way experiments in prior work were designed. We provide the first
systematic evaluation of the variable binding capacities of the
state-of-the-art Transformer networks BERT and RoBERTa. Our experiments are
designed such that the model must generalize a rule across disjoint subsets of
the input vocabulary, and cannot rely on associative pattern matching alone.
The results show a clear discrepancy between classification and
sequence-to-sequence tasks: BERT and RoBERTa can easily learn to copy or
reverse strings even when trained on task-specific vocabularies that are
switched in the test set; but both models completely fail to generalize across
vocabularies in similar sequence classification tasks. These findings indicate
that the effectiveness of Transformers in sequence modelling may lie in their
extensive use of the input itself as an external "memory" rather than
network-internal symbolic operations involving variable binding. Therefore, we
propose a novel direction for future work: augmenting the inputs available to
circumvent the lack of network-internal variable binding.
Related papers
- Comateformer: Combined Attention Transformer for Semantic Sentence Matching [11.746010399185437]
We propose a novel semantic sentence matching model named Combined Attention Network based on Transformer model (Comateformer)
In Comateformer model, we design a novel transformer-based quasi-attention mechanism with compositional properties.
Our proposed approach builds on the intuition of similarity and dissimilarity (negative affinity) when calculating dual affinity scores.
arXiv Detail & Related papers (2024-12-10T06:18:07Z) - Algorithmic Capabilities of Random Transformers [49.73113518329544]
We investigate what functions can be learned by randomly transformers in which only the embedding layers are optimized.
We find that these random transformers can perform a wide range of meaningful algorithmic tasks.
Our results indicate that some algorithmic capabilities are present in transformers even before these models are trained.
arXiv Detail & Related papers (2024-10-06T06:04:23Z) - In-Context Learning for MIMO Equalization Using Transformer-Based
Sequence Models [44.161789477821536]
Large pre-trained sequence models have the capacity to carry out in-context learning (ICL)
In ICL, a decision on a new input is made via a direct mapping of the input and of a few examples from the given task.
We demonstrate via numerical results that transformer-based ICL has a threshold behavior.
arXiv Detail & Related papers (2023-11-10T15:09:04Z) - Causal Interpretation of Self-Attention in Pre-Trained Transformers [4.419843514606336]
We propose a causal interpretation of self-attention in the Transformer neural network architecture.
We use self-attention as a mechanism that estimates a structural equation model for a given input sequence of symbols.
We demonstrate this method by providing causal explanations for the outcomes of Transformers in two tasks: sentiment classification (NLP) and recommendation.
arXiv Detail & Related papers (2023-10-31T09:27:12Z) - When can transformers reason with abstract symbols? [25.63285482210457]
We prove that for any relational reasoning task in a large family of tasks, transformers learn the abstract relations and generalize to the test set.
This is in contrast to classical fully-connected networks, which we prove fail to learn to reason.
arXiv Detail & Related papers (2023-10-15T06:45:38Z) - All Roads Lead to Rome? Exploring the Invariance of Transformers'
Representations [69.3461199976959]
We propose a model based on invertible neural networks, BERT-INN, to learn the Bijection Hypothesis.
We show the advantage of BERT-INN both theoretically and through extensive experiments.
arXiv Detail & Related papers (2023-05-23T22:30:43Z) - Self-Supervised Learning for Group Equivariant Neural Networks [75.62232699377877]
Group equivariant neural networks are the models whose structure is restricted to commute with the transformations on the input.
We propose two concepts for self-supervised tasks: equivariant pretext labels and invariant contrastive loss.
Experiments on standard image recognition benchmarks demonstrate that the equivariant neural networks exploit the proposed self-supervised tasks.
arXiv Detail & Related papers (2023-03-08T08:11:26Z) - Inducing Transformer's Compositional Generalization Ability via
Auxiliary Sequence Prediction Tasks [86.10875837475783]
Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions.
Existing neural models have been shown to lack this basic ability in learning symbolic structures.
We propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics.
arXiv Detail & Related papers (2021-09-30T16:41:19Z) - Category-Learning with Context-Augmented Autoencoder [63.05016513788047]
Finding an interpretable non-redundant representation of real-world data is one of the key problems in Machine Learning.
We propose a novel method of using data augmentations when training autoencoders.
We train a Variational Autoencoder in such a way, that it makes transformation outcome predictable by auxiliary network.
arXiv Detail & Related papers (2020-10-10T14:04:44Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.