What they do when in doubt: a study of inductive biases in seq2seq
learners
- URL: http://arxiv.org/abs/2006.14953v2
- Date: Mon, 29 Mar 2021 09:43:36 GMT
- Title: What they do when in doubt: a study of inductive biases in seq2seq
learners
- Authors: Eugene Kharitonov and Rahma Chaabouni
- Abstract summary: We study how popular seq2seq learners generalize in tasks that have high ambiguity in the training data.
We connect to Solomonoff's theory of induction and propose to use description length as a principled and sensitive measure of inductive biases.
- Score: 22.678902168856624
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sequence-to-sequence (seq2seq) learners are widely used, but we still have
only limited knowledge about what inductive biases shape the way they
generalize. We address that by investigating how popular seq2seq learners
generalize in tasks that have high ambiguity in the training data. We use SCAN
and three new tasks to study learners' preferences for memorization,
arithmetic, hierarchical, and compositional reasoning. Further, we connect to
Solomonoff's theory of induction and propose to use description length as a
principled and sensitive measure of inductive biases.
In our experimental study, we find that LSTM-based learners can learn to
perform counting, addition, and multiplication by a constant from a single
training example. Furthermore, Transformer and LSTM-based learners show a bias
toward the hierarchical induction over the linear one, while CNN-based learners
prefer the opposite. On the SCAN dataset, we find that CNN-based, and, to a
lesser degree, Transformer- and LSTM-based learners have a preference for
compositional generalization over memorization. Finally, across all our
experiments, description length proved to be a sensitive measure of inductive
biases.
Related papers
- A distributional simplicity bias in the learning dynamics of transformers [50.91742043564049]
We show that transformers, trained on natural language data, also display a simplicity bias.
Specifically, they sequentially learn many-body interactions among input tokens, reaching a saturation point in the prediction error for low-degree interactions.
This approach opens up the possibilities of studying how interactions of different orders in the data affect learning, in natural language processing and beyond.
arXiv Detail & Related papers (2024-10-25T15:39:34Z) - Dynamics of Supervised and Reinforcement Learning in the Non-Linear Perceptron [3.069335774032178]
We use a dataset-process approach to derive flow equations describing learning.
We characterize the effects of the learning rule (supervised or reinforcement learning, SL/RL) and input-data distribution on the perceptron's learning curve.
This approach points a way toward analyzing learning dynamics for more-complex circuit architectures.
arXiv Detail & Related papers (2024-09-05T17:58:28Z) - What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages [78.1866280652834]
Large language models (LM) are distributions over strings.
We investigate the learnability of regular LMs (RLMs) by RNN and Transformer LMs.
We find that the complexity of the RLM rank is strong and significant predictors of learnability for both RNNs and Transformers.
arXiv Detail & Related papers (2024-06-06T17:34:24Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Measures of Information Reflect Memorization Patterns [53.71420125627608]
We show that the diversity in the activation patterns of different neurons is reflective of model generalization and memorization.
Importantly, we discover that information organization points to the two forms of memorization, even for neural activations computed on unlabelled in-distribution examples.
arXiv Detail & Related papers (2022-10-17T20:15:24Z) - Evading the Simplicity Bias: Training a Diverse Set of Models Discovers
Solutions with Superior OOD Generalization [93.8373619657239]
Neural networks trained with SGD were recently shown to rely preferentially on linearly-predictive features.
This simplicity bias can explain their lack of robustness out of distribution (OOD)
We demonstrate that the simplicity bias can be mitigated and OOD generalization improved.
arXiv Detail & Related papers (2021-05-12T12:12:24Z) - LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning [30.610670366488943]
We replace architecture engineering by encoding inductive bias in datasets.
Inspired by Peirce's view that deduction, induction, and abduction form an irreducible set of reasoning primitives, we design three synthetic tasks that are intended to require the model to have these three abilities.
Models trained with LIME significantly outperform vanilla transformers on three very different large mathematical reasoning benchmarks.
arXiv Detail & Related papers (2021-01-15T17:15:24Z) - Learning from Failure: Training Debiased Classifier from Biased
Classifier [76.52804102765931]
We show that neural networks learn to rely on spurious correlation only when it is "easier" to learn than the desired knowledge.
We propose a failure-based debiasing scheme by training a pair of neural networks simultaneously.
Our method significantly improves the training of the network against various types of biases in both synthetic and real-world datasets.
arXiv Detail & Related papers (2020-07-06T07:20:29Z) - Universal linguistic inductive biases via meta-learning [36.43388942327124]
It is unclear which inductive biases can explain observed patterns in language acquisition.
We introduce a framework for giving linguistic inductive biases to a neural network model.
We demonstrate this framework with a case study based on syllable structure.
arXiv Detail & Related papers (2020-06-29T19:15:10Z) - Rethink the Connections among Generalization, Memorization and the
Spectral Bias of DNNs [44.5823185453399]
We show that the monotonicity of the learning bias does not always hold.
Under the experimental setup of deep double descent, the high-frequency components of DNNs diminish in the late stage of training.
arXiv Detail & Related papers (2020-04-29T04:24:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.