Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration
- URL: http://arxiv.org/abs/2412.03773v1
- Date: Wed, 04 Dec 2024 23:29:07 GMT
- Title: Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration
- Authors: Chun Hei Yip, Rajashree Agrawal, Lawrence Chan, Jason Gross,
- Abstract summary: We present the first case study in rigorously compressing nonlinear feature-maps.
We target a non-vacuous bound on the behaviour of the ReLU in time linear in the parameter-count of the circuit.
- Score: 1.7679702431368263
- License:
- Abstract: The goal of mechanistic interpretability is discovering simpler, low-rank algorithms implemented by models. While we can compress activations into features, compressing nonlinear feature-maps -- like MLP layers -- is an open problem. In this work, we present the first case study in rigorously compressing nonlinear feature-maps, which are the leading asymptotic bottleneck to compressing small transformer models. We work in the classic setting of the modular addition models, and target a non-vacuous bound on the behaviour of the ReLU MLP in time linear in the parameter-count of the circuit. To study the ReLU MLP analytically, we use the infinite-width lens, which turns post-activation matrix multiplications into approximate integrals. We discover a novel interpretation of} the MLP layer in one-layer transformers implementing the ``pizza'' algorithm: the MLP can be understood as evaluating a quadrature scheme, where each neuron computes the area of a rectangle under the curve of a trigonometric integral identity. Our code is available at https://tinyurl.com/mod-add-integration.
Related papers
- Converting MLPs into Polynomials in Closed Form [0.7234862895932991]
We derive theoretically closed-form least-squares optimal approximations of feedforward networks.
We show that quadratic approximants can be used to create SVD-based adversarial examples.
arXiv Detail & Related papers (2025-02-03T03:54:41Z) - Partially Rewriting a Transformer in Natural Language [0.7234862895932991]
We attempt to partially rewrite a large language model using simple natural language explanations.
We replace the first layer of this sparse with an LLM-based simulator, which predicts the activation of each neuron.
We measure the degree to which these modifications distort the model's final output.
arXiv Detail & Related papers (2025-01-31T01:12:50Z) - From MLP to NeoMLP: Leveraging Self-Attention for Neural Fields [26.659511924272962]
We develop a new type of connectionism based on hidden and scalable nodes, called NeoMLP.
We demonstrate the effectiveness of our method by fitting high-resolution signals, including multi-modal audio-visual data.
arXiv Detail & Related papers (2024-12-11T19:01:38Z) - MLP Can Be A Good Transformer Learner [73.01739251050076]
Self-attention mechanism is the key of the Transformer but often criticized for its computation demands.
This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers.
arXiv Detail & Related papers (2024-04-08T16:40:15Z) - Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.
We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z) - MLIC++: Linear Complexity Multi-Reference Entropy Modeling for Learned
Image Compression [30.71965784982577]
We introduce MEM++, which captures diverse range of correlations inherent in the latent representation.
MEM++ achieves state-of-the-art performance, reducing BD-rate by 13.39% on the Kodak dataset compared to VTM-17.0 in PSNR.
MLIC++ exhibits linear GPU memory consumption with resolution, making it highly suitable for high-resolution image coding.
arXiv Detail & Related papers (2023-07-28T09:11:37Z) - ReLU Fields: The Little Non-linearity That Could [62.228229880658404]
We investigate what is the smallest change to grid-based representations that allows for retaining the high fidelity result ofs.
We show that such an approach becomes competitive with the state-of-the-art.
arXiv Detail & Related papers (2022-05-22T13:42:31Z) - Rethinking Network Design and Local Geometry in Point Cloud: A Simple
Residual MLP Framework [55.40001810884942]
We introduce a pure residual network, called PointMLP, which integrates no sophisticated local geometrical extractors but still performs very competitively.
On the real-world ScanObjectNN dataset, our method even surpasses the prior best method by 3.3% accuracy.
Compared to most recent CurveNet, PointMLP trains 2x faster, tests 7x faster, and is more accurate on ModelNet40 benchmark.
arXiv Detail & Related papers (2022-02-15T01:39:07Z) - Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via
GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer.
In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph.
Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z) - Hybrid Trilinear and Bilinear Programming for Aligning Partially
Overlapping Point Sets [85.71360365315128]
In many applications, we need algorithms which can align partially overlapping point sets are invariant to the corresponding corresponding RPM algorithm.
We first show that the objective is a cubic bound function. We then utilize the convex envelopes of trilinear and bilinear monomial transformations to derive its lower bound.
We next develop a branch-and-bound (BnB) algorithm which only branches over the transformation variables and runs efficiently.
arXiv Detail & Related papers (2021-01-19T04:24:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.