I-BERT: Inductive Generalization of Transformer to Arbitrary Context
Lengths
- URL: http://arxiv.org/abs/2006.10220v2
- Date: Fri, 19 Jun 2020 20:39:09 GMT
- Title: I-BERT: Inductive Generalization of Transformer to Arbitrary Context
Lengths
- Authors: Hyoungwook Nam, Seung Byum Seo, Vikram Sharma Mailthody, Noor Michael,
Lan Li
- Abstract summary: Self-attention has emerged as a vital component of state-of-the-art sequence-to-sequence models for natural language processing.
We propose I-BERT, a bi-directional Transformer that replaces positional encodings with a recurrent layer.
- Score: 2.604653544948958
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-attention has emerged as a vital component of state-of-the-art
sequence-to-sequence models for natural language processing in recent years,
brought to the forefront by pre-trained bi-directional Transformer models. Its
effectiveness is partly due to its non-sequential architecture, which promotes
scalability and parallelism but limits the model to inputs of a bounded length.
In particular, such architectures perform poorly on algorithmic tasks, where
the model must learn a procedure which generalizes to input lengths unseen in
training, a capability we refer to as inductive generalization. Identifying the
computational limits of existing self-attention mechanisms, we propose I-BERT,
a bi-directional Transformer that replaces positional encodings with a
recurrent layer. The model inductively generalizes on a variety of algorithmic
tasks where state-of-the-art Transformer models fail to do so. We also test our
method on masked language modeling tasks where training and validation sets are
partitioned to verify inductive generalization. Out of three algorithmic and
two natural language inductive generalization tasks, I-BERT achieves
state-of-the-art results on four tasks.
Related papers
- Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights.
This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task.
We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z) - Bidirectional Awareness Induction in Autoregressive Seq2Seq Models [47.82947878753809]
Bidirectional Awareness Induction (BAI) is a training method that leverages a subset of elements in the network, the Pivots, to perform bidirectional learning without breaking the autoregressive constraints.
In particular, we observed an increase of up to 2.4 CIDEr in Image-Captioning, 4.96 BLEU in Neural Machine Translation, and 1.16 ROUGE in Text Summarization compared to the respective baselines.
arXiv Detail & Related papers (2024-08-25T23:46:35Z) - Transformers meet Neural Algorithmic Reasoners [16.5785372289558]
We propose a novel approach that combines the Transformer's language understanding with the robustness of graph neural network (GNN)-based neural algorithmic reasoners (NARs)
We evaluate our resulting TransNAR model on CLRS-Text, the text-based version of the CLRS-30 benchmark, and demonstrate significant gains over Transformer-only models for algorithmic reasoning.
arXiv Detail & Related papers (2024-06-13T16:42:06Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - Transformers as Algorithms: Generalization and Implicit Model Selection
in In-context Learning [23.677503557659705]
In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of examples and performs inference on-the-fly.
We treat the transformer model as a learning algorithm that can be specialized via training to implement-at inference-time-another target algorithm.
We show that transformers can act as an adaptive learning algorithm and perform model selection across different hypothesis classes.
arXiv Detail & Related papers (2023-01-17T18:31:12Z) - Real-World Compositional Generalization with Disentangled
Sequence-to-Sequence Learning [81.24269148865555]
A recently proposed Disentangled sequence-to-sequence model (Dangle) shows promising generalization capability.
We introduce two key modifications to this model which encourage more disentangled representations and improve its compute and memory efficiency.
Specifically, instead of adaptively re-encoding source keys and values at each time step, we disentangle their representations and only re-encode keys periodically.
arXiv Detail & Related papers (2022-12-12T15:40:30Z) - Structured Reordering for Modeling Latent Alignments in Sequence
Transduction [86.94309120789396]
We present an efficient dynamic programming algorithm performing exact marginal inference of separable permutations.
The resulting seq2seq model exhibits better systematic generalization than standard models on synthetic problems and NLP tasks.
arXiv Detail & Related papers (2021-06-06T21:53:54Z) - Pretrained Transformers as Universal Computation Engines [105.00539596788127]
We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning.
We study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction.
We find that such pretraining enables FPT to generalize in zero-shot to these modalities, matching the performance of a transformer fully trained on these tasks.
arXiv Detail & Related papers (2021-03-09T06:39:56Z) - Masked Language Modeling for Proteins via Linearly Scalable Long-Context
Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR)
Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors.
It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.