Related papers: Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

URL: http://arxiv.org/abs/2412.03773v1
Date: Wed, 04 Dec 2024 23:29:07 GMT
Title: Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration
Authors: Chun Hei Yip, Rajashree Agrawal, Lawrence Chan, Jason Gross,
Abstract summary: We present the first case study in rigorously compressing nonlinear feature-maps.<n>We target a non-vacuous bound on the behaviour of the ReLU in time linear in the parameter-count of the circuit.
Score: 1.7679702431368263
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The goal of mechanistic interpretability is discovering simpler, low-rank algorithms implemented by models. While we can compress activations into features, compressing nonlinear feature-maps -- like MLP layers -- is an open problem. In this work, we present the first case study in rigorously compressing nonlinear feature-maps, which are the leading asymptotic bottleneck to compressing small transformer models. We work in the classic setting of the modular addition models, and target a non-vacuous bound on the behaviour of the ReLU MLP in time linear in the parameter-count of the circuit. To study the ReLU MLP analytically, we use the infinite-width lens, which turns post-activation matrix multiplications into approximate integrals. We discover a novel interpretation of} the MLP layer in one-layer transformers implementing the ``pizza'' algorithm: the MLP can be understood as evaluating a quadrature scheme, where each neuron computes the area of a rectangle under the curve of a trigonometric integral identity. Our code is available at https://tinyurl.com/mod-add-integration.

Related papers

Converting MLPs into Polynomials in Closed Form [0.7234862895932991]
We derive theoretically closed-form least-squares optimal approximations of feedforward networks. We show that quadratic approximants can be used to create SVD-based adversarial examples.
arXiv Detail & Related papers (2025-02-03T03:54:41Z)
Partially Rewriting a Transformer in Natural Language [0.7234862895932991]
We attempt to partially rewrite a large language model using simple natural language explanations. We replace the first layer of this sparse with an LLM-based simulator, which predicts the activation of each neuron. We measure the degree to which these modifications distort the model's final output.
arXiv Detail & Related papers (2025-01-31T01:12:50Z)
From MLP to NeoMLP: Leveraging Self-Attention for Neural Fields [26.659511924272962]
We develop a new type of connectionism based on hidden and scalable nodes, called NeoMLP. We demonstrate the effectiveness of our method by fitting high-resolution signals, including multi-modal audio-visual data.
arXiv Detail & Related papers (2024-12-11T19:01:38Z)
MLP Can Be A Good Transformer Learner [73.01739251050076]
Self-attention mechanism is the key of the Transformer but often criticized for its computation demands. This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers.
arXiv Detail & Related papers (2024-04-08T16:40:15Z)
Data-free Weight Compress and Denoise for Large Language Models [96.68582094536032]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices. We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z)
MLIC++: Linear Complexity Multi-Reference Entropy Modeling for Learned Image Compression [30.71965784982577]
We introduce MEM++, which captures diverse range of correlations inherent in the latent representation. MEM++ achieves state-of-the-art performance, reducing BD-rate by 13.39% on the Kodak dataset compared to VTM-17.0 in PSNR. MLIC++ exhibits linear GPU memory consumption with resolution, making it highly suitable for high-resolution image coding.
arXiv Detail & Related papers (2023-07-28T09:11:37Z)
Strip-MLP: Efficient Token Interaction for Vision MLP [31.02197585697145]
We introduce textbfStrip-MLP to enrich the token interaction power in three ways. Strip-MLP significantly improves the performance of spatial-based models on small datasets. Models achieve higher average Top-1 accuracy than existing datasets by +2.44% on Caltech-101 and +2.16% on CIFAR-100.
arXiv Detail & Related papers (2023-07-21T09:40:42Z)
ReLU Fields: The Little Non-linearity That Could [62.228229880658404]
We investigate what is the smallest change to grid-based representations that allows for retaining the high fidelity result ofs. We show that such an approach becomes competitive with the state-of-the-art.
arXiv Detail & Related papers (2022-05-22T13:42:31Z)
Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework [55.40001810884942]
We introduce a pure residual network, called PointMLP, which integrates no sophisticated local geometrical extractors but still performs very competitively. On the real-world ScanObjectNN dataset, our method even surpasses the prior best method by 3.3% accuracy. Compared to most recent CurveNet, PointMLP trains 2x faster, tests 7x faster, and is more accurate on ModelNet40 benchmark.
arXiv Detail & Related papers (2022-02-15T01:39:07Z)
Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer. In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph. Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z)
Hybrid Trilinear and Bilinear Programming for Aligning Partially Overlapping Point Sets [85.71360365315128]
In many applications, we need algorithms which can align partially overlapping point sets are invariant to the corresponding corresponding RPM algorithm. We first show that the objective is a cubic bound function. We then utilize the convex envelopes of trilinear and bilinear monomial transformations to derive its lower bound. We next develop a branch-and-bound (BnB) algorithm which only branches over the transformation variables and runs efficiently.
arXiv Detail & Related papers (2021-01-19T04:24:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.