Random matrices in service of ML footprint: ternary random features with
no performance loss
- URL: http://arxiv.org/abs/2110.01899v1
- Date: Tue, 5 Oct 2021 09:33:49 GMT
- Title: Random matrices in service of ML footprint: ternary random features with
no performance loss
- Authors: Hafiz Tiomoko Ali, Zhenyu Liao, Romain Couillet
- Abstract summary: We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
- Score: 55.30329197651178
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this article, we investigate the spectral behavior of random features
kernel matrices of the type ${\bf K} = \mathbb{E}_{{\bf w}}
\left[\sigma\left({\bf w}^{\sf T}{\bf x}_i\right)\sigma\left({\bf w}^{\sf
T}{\bf x}_j\right)\right]_{i,j=1}^n$, with nonlinear function $\sigma(\cdot)$,
data ${\bf x}_1, \ldots, {\bf x}_n \in \mathbb{R}^p$, and random projection
vector ${\bf w} \in \mathbb{R}^p$ having i.i.d. entries. In a high-dimensional
setting where the number of data $n$ and their dimension $p$ are both large and
comparable, we show, under a Gaussian mixture model for the data, that the
eigenspectrum of ${\bf K}$ is independent of the distribution of the
i.i.d.(zero-mean and unit-variance) entries of ${\bf w}$, and only depends on
$\sigma(\cdot)$ via its (generalized) Gaussian moments $\mathbb{E}_{z\sim
\mathcal N(0,1)}[\sigma'(z)]$ and $\mathbb{E}_{z\sim \mathcal
N(0,1)}[\sigma''(z)]$. As a result, for any kernel matrix ${\bf K}$ of the form
above, we propose a novel random features technique, called Ternary Random
Feature (TRF), that (i) asymptotically yields the same limiting kernel as the
original ${\bf K}$ in a spectral sense and (ii) can be computed and stored much
more efficiently, by wisely tuning (in a data-dependent manner) the function
$\sigma$ and the random vector ${\bf w}$, both taking values in $\{-1,0,1\}$.
The computation of the proposed random features requires no multiplication, and
a factor of $b$ times less bits for storage compared to classical random
features such as random Fourier features, with $b$ the number of bits to store
full precision values. Besides, it appears in our experiments on real data that
the substantial gains in computation and storage are accompanied with somewhat
improved performances compared to state-of-the-art random features
compression/quantization methods.
Related papers
- Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - An Over-parameterized Exponential Regression [18.57735939471469]
Recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions.
We define the neural function $F: mathbbRd times m times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd
arXiv Detail & Related papers (2023-03-29T07:29:07Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Spectral properties of sample covariance matrices arising from random
matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$.
Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z) - Kernel Thinning [26.25415159542831]
kernel thinning is a new procedure for compressing a distribution $mathbbP$ more effectively than i.i.d. sampling or standard thinning.
We derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Mat'ern, and B-spline kernels.
arXiv Detail & Related papers (2021-05-12T17:56:42Z) - Two-way kernel matrix puncturing: towards resource-efficient PCA and
spectral clustering [43.50783459690612]
The method consists in randomly "puncturing" both the data matrix $XinmathbbCptimes n$ and its corresponding kernel (Gram) matrix $K$ through Bernoulli masks.
We empirically confirm on GAN-generated image databases, that it is possible to drastically puncture the data, thereby providing possibly huge computational and storage gains.
arXiv Detail & Related papers (2021-02-24T14:01:58Z) - Convergence of Sparse Variational Inference in Gaussian Processes
Regression [29.636483122130027]
We show that a method with an overall computational cost of $mathcalO(log N)2D(loglog N)2)$ can be used to perform inference.
arXiv Detail & Related papers (2020-08-01T19:23:34Z) - Linear Time Sinkhorn Divergences using Positive Features [51.50788603386766]
Solving optimal transport with an entropic regularization requires computing a $ntimes n$ kernel matrix that is repeatedly applied to a vector.
We propose to use instead ground costs of the form $c(x,y)=-logdotpvarphi(x)varphi(y)$ where $varphi$ is a map from the ground space onto the positive orthant $RRr_+$, with $rll n$.
arXiv Detail & Related papers (2020-06-12T10:21:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.