Related papers: I-BERT: Inductive Generalization of Transformer to Arbitrary Context Lengths

I-BERT: Inductive Generalization of Transformer to Arbitrary Context Lengths

URL: http://arxiv.org/abs/2006.10220v2
Date: Fri, 19 Jun 2020 20:39:09 GMT
Title: I-BERT: Inductive Generalization of Transformer to Arbitrary Context Lengths
Authors: Hyoungwook Nam, Seung Byum Seo, Vikram Sharma Mailthody, Noor Michael, Lan Li
Abstract summary: Self-attention has emerged as a vital component of state-of-the-art sequence-to-sequence models for natural language processing. We propose I-BERT, a bi-directional Transformer that replaces positional encodings with a recurrent layer.
Score: 2.604653544948958
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-attention has emerged as a vital component of state-of-the-art sequence-to-sequence models for natural language processing in recent years, brought to the forefront by pre-trained bi-directional Transformer models. Its effectiveness is partly due to its non-sequential architecture, which promotes scalability and parallelism but limits the model to inputs of a bounded length. In particular, such architectures perform poorly on algorithmic tasks, where the model must learn a procedure which generalizes to input lengths unseen in training, a capability we refer to as inductive generalization. Identifying the computational limits of existing self-attention mechanisms, we propose I-BERT, a bi-directional Transformer that replaces positional encodings with a recurrent layer. The model inductively generalizes on a variety of algorithmic tasks where state-of-the-art Transformer models fail to do so. We also test our method on masked language modeling tasks where training and validation sets are partitioned to verify inductive generalization. Out of three algorithmic and two natural language inductive generalization tasks, I-BERT achieves state-of-the-art results on four tasks.

Related papers

When is Task Vector Provably Effective for Model Editing? A Generalization Analysis of Nonlinear Transformers [64.1656365676171]
Task arithmetic refers to editing the pre-trained model by adding a weighted sum of task vectors. This paper theoretically prove the effectiveness of task addition in simultaneously learning a set of irrelevant or irrelevant tasks. We prove the proper selection for task arithmetic to achieve negation to out-of-domain tasks.
arXiv Detail & Related papers (2025-04-15T08:04:39Z)
Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
Bidirectional Awareness Induction in Autoregressive Seq2Seq Models [47.82947878753809]
Bidirectional Awareness Induction (BAI) is a training method that leverages a subset of elements in the network, the Pivots, to perform bidirectional learning without breaking the autoregressive constraints. In particular, we observed an increase of up to 2.4 CIDEr in Image-Captioning, 4.96 BLEU in Neural Machine Translation, and 1.16 ROUGE in Text Summarization compared to the respective baselines.
arXiv Detail & Related papers (2024-08-25T23:46:35Z)
Principled Understanding of Generalization for Generative Transformer Models in Arithmetic Reasoning Tasks [5.522116934552708]
Transformer-based models excel in various tasks but their generalization capabilities, especially in arithmetic reasoning, remain incompletely understood.<n>This paper develops a unified theoretical framework for understanding the generalization behaviors of transformers in arithmetic tasks.
arXiv Detail & Related papers (2024-07-25T11:35:22Z)
Transformers meet Neural Algorithmic Reasoners [16.5785372289558]
We propose a novel approach that combines the Transformer's language understanding with the robustness of graph neural network (GNN)-based neural algorithmic reasoners (NARs) We evaluate our resulting TransNAR model on CLRS-Text, the text-based version of the CLRS-30 benchmark, and demonstrate significant gains over Transformer-only models for algorithmic reasoning.
arXiv Detail & Related papers (2024-06-13T16:42:06Z)
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL. We show that transformers can implement a broad class of standard machine learning algorithms in context. A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z)
Transformers as Algorithms: Generalization and Implicit Model Selection in In-context Learning [23.677503557659705]
In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of examples and performs inference on-the-fly. We treat the transformer model as a learning algorithm that can be specialized via training to implement-at inference-time-another target algorithm. We show that transformers can act as an adaptive learning algorithm and perform model selection across different hypothesis classes.
arXiv Detail & Related papers (2023-01-17T18:31:12Z)
Real-World Compositional Generalization with Disentangled Sequence-to-Sequence Learning [81.24269148865555]
A recently proposed Disentangled sequence-to-sequence model (Dangle) shows promising generalization capability. We introduce two key modifications to this model which encourage more disentangled representations and improve its compute and memory efficiency. Specifically, instead of adaptively re-encoding source keys and values at each time step, we disentangle their representations and only re-encode keys periodically.
arXiv Detail & Related papers (2022-12-12T15:40:30Z)
Structured Reordering for Modeling Latent Alignments in Sequence Transduction [86.94309120789396]
We present an efficient dynamic programming algorithm performing exact marginal inference of separable permutations. The resulting seq2seq model exhibits better systematic generalization than standard models on synthetic problems and NLP tasks.
arXiv Detail & Related papers (2021-06-06T21:53:54Z)
Pretrained Transformers as Universal Computation Engines [105.00539596788127]
We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning. We study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. We find that such pretraining enables FPT to generalize in zero-shot to these modalities, matching the performance of a transformer fully trained on these tasks.
arXiv Detail & Related papers (2021-03-09T06:39:56Z)
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR) Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors. It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.