Related papers: Learning to Understand: Identifying Interactions via the Möbius Transform

Learning to Understand: Identifying Interactions via the Möbius Transform

URL: http://arxiv.org/abs/2402.02631v2
Date: Sat, 15 Jun 2024 20:22:39 GMT
Title: Learning to Understand: Identifying Interactions via the Möbius Transform
Authors: Justin S. Kang, Yigit E. Erginbas, Landon Butler, Ramtin Pedarsani, Kannan Ramchandran,
Abstract summary: We use the M"obius transform to find interpretable representations of learned functions. A robust version of this algorithm withstands noise and maintains this complexity. In several examples, we observe that representations generated via the M"obius transform are up to twice as faithful to the original function.
Score: 18.987216240237483
License: http://creativecommons.org/licenses/by/4.0/
Abstract: One of the key challenges in machine learning is to find interpretable representations of learned functions. The M\"obius transform is essential for this purpose, as its coefficients correspond to unique importance scores for sets of input variables. This transform is closely related to widely used game-theoretic notions of importance like the Shapley and Bhanzaf value, but it also captures crucial higher-order interactions. Although computing the obius Transform of a function with $n$ inputs involves $2^n$ coefficients, it becomes tractable when the function is sparse and of low-degree as we show is the case for many real-world functions. Under these conditions, the complexity of the transform computation is significantly reduced. When there are $K$ non-zero coefficients, our algorithm recovers the M\"obius transform in $O(Kn)$ samples and $O(Kn^2)$ time asymptotically under certain assumptions, the first non-adaptive algorithm to do so. We also uncover a surprising connection between group testing and the M\"obius transform. For functions where all interactions involve at most $t$ inputs, we use group testing results to compute the M\"obius transform with $O(Kt\log n)$ sample complexity and $O(K\mathrm{poly}(n))$ time. A robust version of this algorithm withstands noise and maintains this complexity. This marks the first $n$ sub-linear query complexity, noise-tolerant algorithm for the M\"obius transform. In several examples, we observe that representations generated via sparse M\"obius transform are up to twice as faithful to the original function, as compared to Shaply and Banzhaf values, while using the same number of terms.

Related papers

Learning Compositional Functions with Transformers from Easy-to-Hard Data [63.96562216704653]
We study the learnability of the $k$-fold composition task, which requires computing an interleaved composition of $k$ input permutations and $k$ hidden permutations.<n>We show that this function class can be efficiently learned, with runtime and sample in $k$, by gradient descent on an $O(log k)$-depth transformer.
arXiv Detail & Related papers (2025-05-29T17:22:00Z)
Exact Expressive Power of Transformers with Padding [29.839710738657203]
We show that padded transformers with $O(logd n)$ looping on inputs of length $n$ recognize exactly the class $mathsfTCd$ of moderately parallelizable problems.<n>Our results thus motivate further exploration of padding and looping as parallelizable alternatives to chain of thought.
arXiv Detail & Related papers (2025-05-25T02:52:15Z)
Ehrenfeucht-Haussler Rank and Chain of Thought [51.33559894954108]
We show that the rank of a function $f$ corresponds to the minimum number of Chain of Thought steps required by a single-layer transformer decoder. We also analyze the problem of identifying the position of the $k$-th occurrence of 1 in a Boolean sequence, proving that it requires $k$ CoT steps.
arXiv Detail & Related papers (2025-01-22T16:30:58Z)
Efficient Algorithm for Sparse Fourier Transform of Generalized $q$-ary Functions [0.3004066195320147]
We develop GFast, a coding theoretic algorithm that computes the Fourier transform of $f$ with a sample complexity of $O(Sn)$. GFast enables explaining real-world heart disease diagnosis and protein fitness models using up to $13times$ fewer samples.
arXiv Detail & Related papers (2025-01-21T18:45:09Z)
Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers [16.046186753149]
Self-attention mechanism is the key to the success of transformers in recent Large Language Models (LLMs) We leverage the convolution-like structure of attention matrices to develop an efficient approximation method for attention using convolution matrices. We hope our new paradigm for accelerating attention computation in transformer models can help their application to longer contexts.
arXiv Detail & Related papers (2024-05-08T17:11:38Z)
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z)
Efficiently Learning One-Hidden-Layer ReLU Networks via Schur Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss. Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z)
Efficiently Computing Sparse Fourier Transforms of $q$-ary Functions [12.522202946750157]
We develop a sparse Fourier transform algorithm specifically for $q$-ary functions of length $n$ sequences. We show that for fixed $q$, a robust version of $q$-SFT has a sample complexity of $O(Sn2)$ and a computational complexity of $O(Sn3)$ with the same guarantees.
arXiv Detail & Related papers (2023-01-15T22:04:53Z)
A Law of Robustness beyond Isoperimetry [84.33752026418045]
We prove a Lipschitzness lower bound $Omega(sqrtn/p)$ of robustness of interpolating neural network parameters on arbitrary distributions. We then show the potential benefit of overparametrization for smooth data when $n=mathrmpoly(d)$. We disprove the potential existence of an $O(1)$-Lipschitz robust interpolating function when $n=exp(omega(d))$.
arXiv Detail & Related papers (2022-02-23T16:10:23Z)
Understanding and Compressing Music with Maximal Transformable Patterns [0.0]
We present an algorithm that discovers maximal patterns in a point set, $DinmathbbRk$. We also present a second algorithm that discovers the set of occurrences for each of these maximal patterns. We evaluate the new compression algorithm with three classes of differing complexity on the task of classifying folk-song melodies into tune families.
arXiv Detail & Related papers (2022-01-26T17:47:26Z)
$k$-Forrelation Optimally Separates Quantum and Classical Query Complexity [3.4984289152418753]
We show that any partial function on $N$ bits can be computed with an advantage $delta$ over a random guess by making $q$ quantum queries. We also conjectured the $k$-Forrelation problem -- a partial function that can be computed with $q = lceil k/2 rceil$ quantum queries.
arXiv Detail & Related papers (2020-08-16T21:26:46Z)
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections. We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
Quantum Legendre-Fenchel Transform [6.643082745560234]
We present a quantum algorithm to compute the discrete Legendre-Fenchel transform. We show that our quantum algorithm is optimal up to polylogarithmic factors.
arXiv Detail & Related papers (2020-06-08T18:00:05Z)
A Randomized Algorithm to Reduce the Support of Discrete Measures [79.55586575988292]
Given a discrete probability measure supported on $N$ atoms and a set of $n$ real-valued functions, there exists a probability measure that is supported on a subset of $n+1$ of the original $N$ atoms. We give a simple geometric characterization of barycenters via negative cones and derive a randomized algorithm that computes this new measure by "greedy geometric sampling" We then study its properties, and benchmark it on synthetic and real-world data to show that it can be very beneficial in the $Ngg n$ regime.
arXiv Detail & Related papers (2020-06-02T16:38:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.