Transferring Inductive Biases through Knowledge Distillation
- URL: http://arxiv.org/abs/2006.00555v3
- Date: Sun, 4 Oct 2020 19:57:06 GMT
- Title: Transferring Inductive Biases through Knowledge Distillation
- Authors: Samira Abnar and Mostafa Dehghani and Willem Zuidema
- Abstract summary: We explore the power of knowledge distillation for transferring the effect of inductive biases from one model to another.
We study the effect of inductive biases on the solutions the models converge to and investigate how and to what extent the effect of inductive biases is transferred through knowledge distillation.
- Score: 21.219305008067735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Having the right inductive biases can be crucial in many tasks or scenarios
where data or computing resources are a limiting factor, or where training data
is not perfectly representative of the conditions at test time. However,
defining, designing and efficiently adapting inductive biases is not
necessarily straightforward. In this paper, we explore the power of knowledge
distillation for transferring the effect of inductive biases from one model to
another. We consider families of models with different inductive biases, LSTMs
vs. Transformers and CNNs vs. MLPs, in the context of tasks and scenarios where
having the right inductive biases is critical. We study the effect of inductive
biases on the solutions the models converge to and investigate how and to what
extent the effect of inductive biases is transferred through knowledge
distillation, in terms of not only performance but also different aspects of
converged solutions.
Related papers
- MIRAGE: Evaluating and Explaining Inductive Reasoning Process in Language Models [19.81485079689837]
We evaluate large language models' capabilities in inductive and deductive stages.
We find that the models tend to consistently conduct correct deduction without correct inductive rules.
In the inductive reasoning process, the model tends to focus on observed facts that are close to the current test example in feature space.
arXiv Detail & Related papers (2024-10-12T14:12:36Z) - Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis [82.51626700527837]
Chain-of-shift (CoT) is an efficient method that enables the reasoning ability of large language models by augmenting the query using examples with multiple intermediate steps.
We show that despite the theoretical success of CoT, it fails to provide an accurate generalization when CoT does.
arXiv Detail & Related papers (2024-10-03T03:12:51Z) - Towards Exact Computation of Inductive Bias [8.988109761916379]
We propose a novel method for efficiently computing the inductive bias required for generalization on a task.
We show that higher dimensional tasks require greater inductive bias.
Our proposed inductive bias metric provides an information-theoretic interpretation of the benefits of specific model architectures.
arXiv Detail & Related papers (2024-06-22T21:14:24Z) - Tripod: Three Complementary Inductive Biases for Disentangled Representation Learning [52.70210390424605]
In this work, we consider endowing a neural network autoencoder with three select inductive biases from the literature.
In practice, however, naively combining existing techniques instantiating these inductive biases fails to yield significant benefits.
We propose adaptations to the three techniques that simplify the learning problem, equip key regularization terms with stabilizing invariances, and quash degenerate incentives.
The resulting model, Tripod, achieves state-of-the-art results on a suite of four image disentanglement benchmarks.
arXiv Detail & Related papers (2024-04-16T04:52:41Z) - Instilling Inductive Biases with Subnetworks [19.444844580405594]
Subtask Induction instills inductive biases towards solutions utilizing a subtask.
We show that Subtask Induction significantly reduces the amount of training data required for a model to adopt a specific, generalizable solution.
arXiv Detail & Related papers (2023-10-17T00:12:19Z) - SIP: Injecting a Structural Inductive Bias into a Seq2Seq Model by Simulation [75.14793516745374]
We show how a structural inductive bias can be efficiently injected into a seq2seq model by pre-training it to simulate structural transformations on synthetic data.
Our experiments show that our method imparts the desired inductive bias, resulting in better few-shot learning for FST-like tasks.
arXiv Detail & Related papers (2023-10-01T21:19:12Z) - Distilling Inductive Bias: Knowledge Distillation Beyond Model
Compression [6.508088032296086]
Vision Transformers (ViTs) offer the tantalizing prospect of unified information processing across visual and textual domains.
We introduce an innovative ensemble-based distillation approach distilling inductive bias from complementary lightweight teacher models.
Our proposed framework also involves precomputing and storing logits in advance, essentially the unnormalized predictions of the model.
arXiv Detail & Related papers (2023-09-30T13:21:29Z) - Equivariance and Invariance Inductive Bias for Learning from
Insufficient Data [65.42329520528223]
We show why insufficient data renders the model more easily biased to the limited training environments that are usually different from testing.
We propose a class-wise invariant risk minimization (IRM) that efficiently tackles the challenge of missing environmental annotation in conventional IRM.
arXiv Detail & Related papers (2022-07-25T15:26:19Z) - Agree to Disagree: Diversity through Disagreement for Better
Transferability [54.308327969778155]
We propose D-BAT (Diversity-By-disAgreement Training), which enforces agreement among the models on the training data.
We show how D-BAT naturally emerges from the notion of generalized discrepancy.
arXiv Detail & Related papers (2022-02-09T12:03:02Z) - Evading the Simplicity Bias: Training a Diverse Set of Models Discovers
Solutions with Superior OOD Generalization [93.8373619657239]
Neural networks trained with SGD were recently shown to rely preferentially on linearly-predictive features.
This simplicity bias can explain their lack of robustness out of distribution (OOD)
We demonstrate that the simplicity bias can be mitigated and OOD generalization improved.
arXiv Detail & Related papers (2021-05-12T12:12:24Z) - LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning [30.610670366488943]
We replace architecture engineering by encoding inductive bias in datasets.
Inspired by Peirce's view that deduction, induction, and abduction form an irreducible set of reasoning primitives, we design three synthetic tasks that are intended to require the model to have these three abilities.
Models trained with LIME significantly outperform vanilla transformers on three very different large mathematical reasoning benchmarks.
arXiv Detail & Related papers (2021-01-15T17:15:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.