Weight-based Decomposition: A Case for Bilinear MLPs
- URL: http://arxiv.org/abs/2406.03947v1
- Date: Thu, 6 Jun 2024 10:46:51 GMT
- Title: Weight-based Decomposition: A Case for Bilinear MLPs
- Authors: Michael T. Pearce, Thomas Dooms, Alice Rigg,
- Abstract summary: Gated Linear Units (GLUs) have become a common building block in modern foundation models.
Bilinear layers drop the non-linearity in the "gate" but still have comparable performance to other GLUs.
We develop a method to decompose the bilinear tensor into a set of interacting eigenvectors.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Gated Linear Units (GLUs) have become a common building block in modern foundation models. Bilinear layers drop the non-linearity in the "gate" but still have comparable performance to other GLUs. An attractive quality of bilinear layers is that they can be fully expressed in terms of a third-order tensor and linear operations. Leveraging this, we develop a method to decompose the bilinear tensor into a set of sparsely interacting eigenvectors that show promising interpretability properties in preliminary experiments for shallow image classifiers (MNIST) and small language models (Tiny Stories). Since the decomposition is fully equivalent to the model's original computations, bilinear layers may be an interpretability-friendly architecture that helps connect features to the model weights. Application of our method may not be limited to pretrained bilinear models since we find that language models such as TinyLlama-1.1B can be finetuned into bilinear variants.
Related papers
- Bilinear MLPs enable weight-based mechanistic interpretability [0.0]
Bilinear layers serve as an interpretable drop-in replacement for current activation functions.
Weight-based interpretability is viable for understanding deep-learning models.
arXiv Detail & Related papers (2024-10-10T23:22:11Z) - Scaling Laws for Linear Complexity Language Models [18.787664489713332]
We present the scaling laws for linear complexity language models to establish a foundation for their scalability.
The study reveals that existing linear complexity language models exhibit similar scaling capabilities as conventional transformer-based models.
arXiv Detail & Related papers (2024-06-24T14:51:31Z) - A technical note on bilinear layers for interpretability [0.0]
Bilinear layers are a type of layer that are mathematically much easier to analyze.
We can integrate this expression for bilinear layers into a mathematical framework for transformer circuits.
arXiv Detail & Related papers (2023-05-05T11:56:26Z) - BELIEF in Dependence: Leveraging Atomic Linearity in Data Bits for
Rethinking Generalized Linear Models [6.435660232678891]
We develop a framework called binary expansion linear effect (BELIEF) for understanding arbitrary relationships with a binary outcome.
Models from the BELIEF framework are easily interpretable because they describe the association of binary variables in the language of linear models.
arXiv Detail & Related papers (2022-10-19T19:28:09Z) - Graph Polynomial Convolution Models for Node Classification of
Non-Homophilous Graphs [52.52570805621925]
We investigate efficient learning from higher-order graph convolution and learning directly from adjacency matrix for node classification.
We show that the resulting model lead to new graphs and residual scaling parameter.
We demonstrate that the proposed methods obtain improved accuracy for node-classification of non-homophilous parameters.
arXiv Detail & Related papers (2022-09-12T04:46:55Z) - Linear Connectivity Reveals Generalization Strategies [54.947772002394736]
Some pairs of finetuned models have large barriers of increasing loss on the linear paths between them.
We find distinct clusters of models which are linearly connected on the test loss surface, but are disconnected from models outside the cluster.
Our work demonstrates how the geometry of the loss surface can guide models towards different functions.
arXiv Detail & Related papers (2022-05-24T23:43:02Z) - Using Low-rank Representation of Abundance Maps and Nonnegative Tensor
Factorization for Hyperspectral Nonlinear Unmixing [28.064111391414773]
We propose a nonlinear low-rank tensor unmixing algorithm to solve the generalized bilinear model (GBM)
Specifically, the linear and nonlinear parts of the GBM can both be expressed as tensors.
Low-rank structures of abundance maps and nonlinear interaction maps are exploited by minimizing their nuclear norm.
arXiv Detail & Related papers (2021-03-30T09:37:25Z) - Bilinear Classes: A Structural Framework for Provable Generalization in
RL [119.42509700822484]
Bilinear Classes is a new structural framework which permits generalization in reinforcement learning.
The framework incorporates nearly all existing models in which a sample complexity is achievable.
Our main result provides an RL algorithm which has sample complexity for Bilinear Classes.
arXiv Detail & Related papers (2021-03-19T16:34:20Z) - Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve
Optimism, Embrace Virtual Curvature [61.22680308681648]
We show that global convergence is statistically intractable even for one-layer neural net bandit with a deterministic reward.
For both nonlinear bandit and RL, the paper presents a model-based algorithm, Virtual Ascent with Online Model Learner (ViOL)
arXiv Detail & Related papers (2021-02-08T12:41:56Z) - Non-parametric Models for Non-negative Functions [48.7576911714538]
We provide the first model for non-negative functions from the same good linear models.
We prove that it admits a representer theorem and provide an efficient dual formulation for convex problems.
arXiv Detail & Related papers (2020-07-08T07:17:28Z) - Learning Bijective Feature Maps for Linear ICA [73.85904548374575]
We show that existing probabilistic deep generative models (DGMs) which are tailor-made for image data, underperform on non-linear ICA tasks.
To address this, we propose a DGM which combines bijective feature maps with a linear ICA model to learn interpretable latent structures for high-dimensional data.
We create models that converge quickly, are easy to train, and achieve better unsupervised latent factor discovery than flow-based models, linear ICA, and Variational Autoencoders on images.
arXiv Detail & Related papers (2020-02-18T17:58:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.