Feature Cross Search via Submodular Optimization
- URL: http://arxiv.org/abs/2107.02139v1
- Date: Mon, 5 Jul 2021 16:58:31 GMT
- Title: Feature Cross Search via Submodular Optimization
- Authors: Lin Chen, Hossein Esfandiari, Gang Fu, Vahab S. Mirrokni, Qian Yu
- Abstract summary: We study feature cross search as a fundamental primitive in feature engineering.
We show that there exists a simple greedy $(1-1/e)$-approximation algorithm for this problem.
- Score: 58.15569071608769
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study feature cross search as a fundamental primitive in
feature engineering. The importance of feature cross search especially for the
linear model has been known for a while, with well-known textbook examples. In
this problem, the goal is to select a small subset of features, combine them to
form a new feature (called the crossed feature) by considering their Cartesian
product, and find feature crosses to learn an \emph{accurate} model. In
particular, we study the problem of maximizing a normalized Area Under the
Curve (AUC) of the linear model trained on the crossed feature column.
First, we show that it is not possible to provide an $n^{1/\log\log
n}$-approximation algorithm for this problem unless the exponential time
hypothesis fails. This result also rules out the possibility of solving this
problem in polynomial time unless $\mathsf{P}=\mathsf{NP}$. On the positive
side, by assuming the \naive\ assumption, we show that there exists a simple
greedy $(1-1/e)$-approximation algorithm for this problem. This result is
established by relating the AUC to the total variation of the commutator of two
probability measures and showing that the total variation of the commutator is
monotone and submodular. To show this, we relate the submodularity of this
function to the positive semi-definiteness of a corresponding kernel matrix.
Then, we use Bochner's theorem to prove the positive semi-definiteness by
showing that its inverse Fourier transform is non-negative everywhere. Our
techniques and structural results might be of independent interest.
Related papers
- Data subsampling for Poisson regression with pth-root-link [53.63838219437508]
We develop and analyze data subsampling techniques for Poisson regression.
In particular, we consider the Poisson generalized linear model with ID- and square root-link functions.
arXiv Detail & Related papers (2024-10-30T10:09:05Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Agnostically Learning Multi-index Models with Queries [54.290489524576756]
We study the power of query access for the task of agnostic learning under the Gaussian distribution.
We show that query access gives significant runtime improvements over random examples for agnostically learning MIMs.
arXiv Detail & Related papers (2023-12-27T15:50:47Z) - Near Optimal Heteroscedastic Regression with Symbiotic Learning [29.16456701187538]
We consider the problem of heteroscedastic linear regression.
We can estimate $mathbfw*$ in squared norm up to an error of $tildeOleft(|mathbff*|2cdot left(frac1n + left(dnright)2right)$ and prove a matching lower bound.
arXiv Detail & Related papers (2023-06-25T16:32:00Z) - Private Isotonic Regression [54.32252900997422]
We consider the problem of isotonic regression over a partially ordered set (poset) $mathcalX$ and for any Lipschitz loss function.
We obtain a pure-DP algorithm that has an expected excess empirical risk of roughly $mathrmwidth(mathcalX) cdot log|mathcalX| / n$, where $mathrmwidth(mathcalX)$ is the width of the poset.
We show that the bounds above are essentially the best that can be
arXiv Detail & Related papers (2022-10-27T05:08:07Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Analysis of One-Hidden-Layer Neural Networks via the Resolvent Method [0.0]
Motivated by random neural networks, we consider the random matrix $M = Y Yast$ with $Y = f(WX)$.
We prove that the Stieltjes transform of the limiting spectral distribution satisfies a quartic self-consistent equation up to some error terms.
In addition, we extend the previous results to the case of additive bias $Y=f(WX+B)$ with $B$ being an independent rank-one Gaussian random matrix.
arXiv Detail & Related papers (2021-05-11T15:17:39Z) - Last iterate convergence of SGD for Least-Squares in the Interpolation
regime [19.05750582096579]
We study the noiseless model in the fundamental least-squares setup.
We assume that an optimum predictor fits perfectly inputs and outputs $langle theta_*, phi(X) rangle = Y$, where $phi(X)$ stands for a possibly infinite dimensional non-linear feature map.
arXiv Detail & Related papers (2021-02-05T14:02:20Z) - Estimating Stochastic Linear Combination of Non-linear Regressions
Efficiently and Scalably [23.372021234032363]
We show that when the sub-sample sizes are large then the estimation errors will be sacrificed by too much.
To the best of our knowledge, this is the first work that and guarantees for the lineartext+Stochasticity model.
arXiv Detail & Related papers (2020-10-19T07:15:38Z) - Nonparametric Learning of Two-Layer ReLU Residual Units [22.870658194212744]
We describe an algorithm that learns two-layer residual units with rectified linear unit (ReLU) activation.
We design layer-wise objectives as functionals whose analytic minimizers express the exact ground-truth network in terms of its parameters and nonlinearities.
We prove the statistical strong consistency of our algorithm, and demonstrate the robustness and sample efficiency of our algorithm by experiments.
arXiv Detail & Related papers (2020-08-17T22:11:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.