Related papers: SIP: Injecting a Structural Inductive Bias into a Seq2Seq Model by Simulation

SIP: Injecting a Structural Inductive Bias into a Seq2Seq Model by Simulation

URL: http://arxiv.org/abs/2310.00796v3
Date: Wed, 10 Jul 2024 17:09:58 GMT
Title: SIP: Injecting a Structural Inductive Bias into a Seq2Seq Model by Simulation
Authors: Matthias Lindemann, Alexander Koller, Ivan Titov,
Abstract summary: We show how a structural inductive bias can be efficiently injected into a seq2seq model by pre-training it to simulate structural transformations on synthetic data. Our experiments show that our method imparts the desired inductive bias, resulting in better few-shot learning for FST-like tasks.
Score: 75.14793516745374
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Strong inductive biases enable learning from little data and help generalization outside of the training distribution. Popular neural architectures such as Transformers lack strong structural inductive biases for seq2seq NLP tasks on their own. Consequently, they struggle with systematic generalization beyond the training distribution, e.g. with extrapolating to longer inputs, even when pre-trained on large amounts of text. We show how a structural inductive bias can be efficiently injected into a seq2seq model by pre-training it to simulate structural transformations on synthetic data. Specifically, we inject an inductive bias towards Finite State Transducers (FSTs) into a Transformer by pre-training it to simulate FSTs given their descriptions. Our experiments show that our method imparts the desired inductive bias, resulting in improved systematic generalization and better few-shot learning for FST-like tasks. Our analysis shows that fine-tuned models accurately capture the state dynamics of the unseen underlying FSTs, suggesting that the simulation process is internalized by the fine-tuned model.

Related papers

Efficient dynamic modal load reconstruction using physics-informed Gaussian processes based on frequency-sparse Fourier basis functions [0.0]
This paper presents an efficient dynamic load reconstruction method using physics-informed Gaussian processes (GP) The GP's covariance matrices are built using the description of the system dynamics, and the model is trained using structural response measurements. The developed model holds potential for applications in structural health monitoring, damage prognosis, and load model validation.
arXiv Detail & Related papers (2025-03-12T14:16:27Z)
Learning and Transferring Sparse Contextual Bigrams with Linear Transformers [47.37256334633102]
We introduce the Sparse Con Bigram model, where the next token's generation depends on a sparse set of earlier positions determined by the last token. We analyze the training dynamics and sample complexity of learning SCB using a one-layer linear transformer with a gradient-based algorithm. We prove that, provided a nontrivial correlation between the downstream and pretraining tasks, finetuning from a pretrained model allows us to bypass the initial sample-intensive stage.
arXiv Detail & Related papers (2024-10-30T20:29:10Z)
State-space models can learn in-context by gradient descent [1.3087858009942543]
We show that state-space models can perform gradient-based learning and use it for in-context learning in much the same way as transformers. Specifically, we prove that a single structured state-space model layer, augmented with multiplicative input and output gating, can reproduce the outputs of an implicit linear model. We also provide novel insights into the relationship between state-space models and linear self-attention, and their ability to learn in-context.
arXiv Detail & Related papers (2024-10-15T15:22:38Z)
On the Inductive Bias of Stacking Towards Improving Reasoning [50.225873619537765]
We propose a variant of gradual stacking called MIDAS that can speed up language model training by up to 40%. MIDAS is not only training-efficient but surprisingly also has an inductive bias towards improving downstream tasks. We conjecture the underlying reason for this inductive bias by exploring the connection of stacking to looped models.
arXiv Detail & Related papers (2024-09-27T17:58:21Z)
Non-asymptotic Convergence of Training Transformers for Next-token Prediction [48.9399496805422]
Transformers have achieved extraordinary success in modern machine learning due to their excellent ability to handle sequential data. This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer. We show that the trained transformer presents non-token prediction ability with dataset shift.
arXiv Detail & Related papers (2024-09-25T20:22:06Z)
Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations [75.14793516745374]
We propose to strengthen the structural inductive bias of a Transformer by intermediate pre-training. Our experiments confirm that this helps with few-shot learning of syntactic tasks such as chunking. Our analysis shows that the intermediate pre-training leads to attention heads that keep track of which syntactic transformation needs to be applied to which token.
arXiv Detail & Related papers (2024-07-05T14:29:44Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
Trained Transformers Learn Linear Models In-Context [39.56636898650966]
Attention-based neural networks as transformers have demonstrated a remarkable ability to exhibit inattention learning (ICL) We show that when transformer training over random instances of linear regression problems, these models' predictions mimic nonlinear of ordinary squares.
arXiv Detail & Related papers (2023-06-16T15:50:03Z)
Leveraging Pre-trained Models for Failure Analysis Triplets Generation [0.0]
We leverage the attention mechanism of pre-trained causal language models such as Transformer model for the downstream task of generating Failure Analysis Triplets (FATs) We observe that Generative Pre-trained Transformer 2 (GPT2) outperformed other transformer model for the failure analysis triplet generation (FATG) task. In particular, we observe that GPT2 (trained on 1.5B parameters) outperforms pre-trained BERT, BART and GPT3 by a large margin on ROUGE.
arXiv Detail & Related papers (2022-10-31T17:21:15Z)
Pretrained Transformers as Universal Computation Engines [105.00539596788127]
We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning. We study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. We find that such pretraining enables FPT to generalize in zero-shot to these modalities, matching the performance of a transformer fully trained on these tasks.
arXiv Detail & Related papers (2021-03-09T06:39:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.