On the Self-Penalization Phenomenon in Feature Selection
- URL: http://arxiv.org/abs/2110.05852v1
- Date: Tue, 12 Oct 2021 09:36:41 GMT
- Title: On the Self-Penalization Phenomenon in Feature Selection
- Authors: Michael I. Jordan, Keli Liu, and Feng Ruan
- Abstract summary: We describe an implicit sparsity-inducing mechanism based on over a family of kernels.
As an application, we use this sparsity-inducing mechanism to build algorithms consistent for feature selection.
- Score: 69.16452769334367
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe an implicit sparsity-inducing mechanism based on minimization
over a family of kernels: \begin{equation*}
\min_{\beta, f}~\widehat{\mathbb{E}}[L(Y, f(\beta^{1/q} \odot X)] + \lambda_n
\|f\|_{\mathcal{H}_q}^2~~\text{subject to}~~\beta \ge 0, \end{equation*} where
$L$ is the loss, $\odot$ is coordinate-wise multiplication and $\mathcal{H}_q$
is the reproducing kernel Hilbert space based on the kernel $k_q(x, x') =
h(\|x-x'\|_q^q)$, where $\|\cdot\|_q$ is the $\ell_q$ norm. Using gradient
descent to optimize this objective with respect to $\beta$ leads to exactly
sparse stationary points with high probability. The sparsity is achieved
without using any of the well-known explicit sparsification techniques such as
penalization (e.g., $\ell_1$), early stopping or post-processing (e.g.,
clipping).
As an application, we use this sparsity-inducing mechanism to build
algorithms consistent for feature selection.
Related papers
- Upper Bounds for Learning in Reproducing Kernel Hilbert Spaces for Orbits of an Iterated Function System [1.1510009152620668]
A key problem in learning theory is to compute a function $f$ that closely approximates the relationship between some input $x$ and corresponding output $y$.
This approximation is based on sample points $(x_t,y_t)_t=1m$, where the function $f$ can be approximated within reproducing kernel Hilbert spaces.
arXiv Detail & Related papers (2024-10-10T20:34:22Z) - Complexity of Minimizing Projected-Gradient-Dominated Functions with Stochastic First-order Oracles [38.45952947660789]
This work investigates the performance limits of projected first-order methods for minimizing functions under the $(alpha,tau,mathcal)$-projected-dominance property.
arXiv Detail & Related papers (2024-08-03T18:34:23Z) - Solving Dense Linear Systems Faster Than via Preconditioning [1.8854491183340518]
We show that our algorithm has an $tilde O(n2)$ when $k=O(n0.729)$.
In particular, our algorithm has an $tilde O(n2)$ when $k=O(n0.729)$.
Our main algorithm can be viewed as a randomized block coordinate descent method.
arXiv Detail & Related papers (2023-12-14T12:53:34Z) - Convergence of a Normal Map-based Prox-SGD Method under the KL
Inequality [0.0]
We present a novel map-based algorithm ($mathsfnorMtext-mathsfSGD$) for $symbol$k$ convergence problems.
arXiv Detail & Related papers (2023-05-10T01:12:11Z) - Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products [58.05771390012827]
We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm.
Our main result is an algorithm that uses only $tildeO(k/sqrtepsilon)$ matrix-vector products.
arXiv Detail & Related papers (2022-02-10T16:10:41Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Spectral properties of sample covariance matrices arising from random
matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$.
Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z) - DIPPA: An improved Method for Bilinear Saddle Point Problems [18.65143269806133]
This paper deals with point dependency problems $min_bfx max_bfy g(fracx) + bfxtop bfbftop fracbfA kappa_x kappa_x (kappa_x + kappa_y) kappa_y (kappa_x + kappa_y) kappa_y (kappa_x + kappa_y)
arXiv Detail & Related papers (2021-03-15T10:55:30Z) - Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure.
We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds.
We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z) - Optimal Mean Estimation without a Variance [103.26777953032537]
We study the problem of heavy-tailed mean estimation in settings where the variance of the data-generating distribution does not exist.
We design an estimator which attains the smallest possible confidence interval as a function of $n,d,delta$.
arXiv Detail & Related papers (2020-11-24T22:39:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.