Taming Nonconvexity in Kernel Feature Selection---Favorable Properties
of the Laplace Kernel
- URL: http://arxiv.org/abs/2106.09387v1
- Date: Thu, 17 Jun 2021 11:05:48 GMT
- Title: Taming Nonconvexity in Kernel Feature Selection---Favorable Properties
of the Laplace Kernel
- Authors: Feng Ruan, Keli Liu, Michael I. Jordan
- Abstract summary: A challenge is to establish the objective function of kernel-based feature selection.
The gradient-based algorithms available for non-global optimization are only able to guarantee convergence to local minima.
- Score: 77.73399781313893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Kernel-based feature selection is an important tool in nonparametric
statistics. Despite many practical applications of kernel-based feature
selection, there is little statistical theory available to support the method.
A core challenge is the objective function of the optimization problems used to
define kernel-based feature selection are nonconvex. The literature has only
studied the statistical properties of the \emph{global optima}, which is a
mismatch, given that the gradient-based algorithms available for nonconvex
optimization are only able to guarantee convergence to local minima. Studying
the full landscape associated with kernel-based methods, we show that feature
selection objectives using the Laplace kernel (and other $\ell_1$ kernels) come
with statistical guarantees that other kernels, including the ubiquitous
Gaussian kernel (or other $\ell_2$ kernels) do not possess. Based on a sharp
characterization of the gradient of the objective function, we show that
$\ell_1$ kernels eliminate unfavorable stationary points that appear when using
an $\ell_2$ kernel. Armed with this insight, we establish statistical
guarantees for $\ell_1$ kernel-based feature selection which do not require
reaching the global minima. In particular, we establish model-selection
consistency of $\ell_1$-kernel-based feature selection in recovering main
effects and hierarchical interactions in the nonparametric setting with $n \sim
\log p$ samples.
Related papers
- Optimal Kernel Choice for Score Function-based Causal Discovery [92.65034439889872]
We propose a kernel selection method within the generalized score function that automatically selects the optimal kernel that best fits the data.
We conduct experiments on both synthetic data and real-world benchmarks, and the results demonstrate that our proposed method outperforms kernel selection methods.
arXiv Detail & Related papers (2024-07-14T09:32:20Z) - Conditional mean embeddings and optimal feature selection via positive
definite kernels [0.0]
We consider operator theoretic approaches to Conditional Conditional embeddings (CME)
Our results combine a spectral analysis-based optimization scheme with the use of kernels, processes, and constructive learning algorithms.
arXiv Detail & Related papers (2023-05-14T08:29:15Z) - Structural Kernel Search via Bayesian Optimization and Symbolical
Optimal Transport [5.1672267755831705]
For Gaussian processes, selecting the kernel is a crucial task, often done manually by the expert.
We propose a novel, efficient search method through a general, structured kernel space.
arXiv Detail & Related papers (2022-10-21T09:30:21Z) - Variational Autoencoder Kernel Interpretation and Selection for
Classification [59.30734371401315]
This work proposed kernel selection approaches for probabilistic classifiers based on features produced by the convolutional encoder of a variational autoencoder.
In the proposed implementation, each latent variable was sampled from the distribution associated with a single kernel of the last encoder's convolution layer, as an individual distribution was created for each kernel.
choosing relevant features on the sampled latent variables makes it possible to perform kernel selection, filtering the uninformative features and kernels.
arXiv Detail & Related papers (2022-09-10T17:22:53Z) - Learning "best" kernels from data in Gaussian process regression. With
application to aerodynamics [0.4588028371034406]
We introduce algorithms to select/design kernels in Gaussian process regression/kriging surrogate modeling techniques.
A first class of algorithms is kernel flow, which was introduced in a context of classification in machine learning.
A second class of algorithms is called spectral kernel ridge regression, and aims at selecting a "best" kernel such that the norm of the function to be approximated is minimal.
arXiv Detail & Related papers (2022-06-03T07:50:54Z) - Kernel Identification Through Transformers [54.3795894579111]
Kernel selection plays a central role in determining the performance of Gaussian Process (GP) models.
This work addresses the challenge of constructing custom kernel functions for high-dimensional GP regression models.
We introduce a novel approach named KITT: Kernel Identification Through Transformers.
arXiv Detail & Related papers (2021-06-15T14:32:38Z) - Kernel k-Means, By All Means: Algorithms and Strong Consistency [21.013169939337583]
Kernel $k$ clustering is a powerful tool for unsupervised learning of non-linear data.
In this paper, we generalize results leveraging a general family of means to combat sub-optimal local solutions.
Our algorithm makes use of majorization-minimization (MM) to better solve this non-linear separation problem.
arXiv Detail & Related papers (2020-11-12T16:07:18Z) - Isolation Distributional Kernel: A New Tool for Point & Group Anomaly
Detection [76.1522587605852]
Isolation Distributional Kernel (IDK) is a new way to measure the similarity between two distributions.
We demonstrate IDK's efficacy and efficiency as a new tool for kernel based anomaly detection for both point and group anomalies.
arXiv Detail & Related papers (2020-09-24T12:25:43Z) - Scaling up Kernel Ridge Regression via Locality Sensitive Hashing [6.704115928005158]
We introduce a weighted version of random binning features and show that the corresponding kernel function generates smooth Gaussian processes.
We show that our weighted random binning features provide a spectral approximation to the corresponding kernel matrix, leading to efficient algorithms for kernel ridge regression.
arXiv Detail & Related papers (2020-03-21T21:41:16Z) - Learning Deep Kernels for Non-Parametric Two-Sample Tests [50.92621794426821]
We propose a class of kernel-based two-sample tests, which aim to determine whether two sets of samples are drawn from the same distribution.
Our tests are constructed from kernels parameterized by deep neural nets, trained to maximize test power.
arXiv Detail & Related papers (2020-02-21T03:54:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.