pGMM Kernel Regression and Comparisons with Boosted Trees
- URL: http://arxiv.org/abs/2207.08667v1
- Date: Mon, 18 Jul 2022 15:06:30 GMT
- Title: pGMM Kernel Regression and Comparisons with Boosted Trees
- Authors: Ping Li and Weijie Zhao
- Abstract summary: In this work, we demonstrate the advantage of the pGMM kernel in the context of (ridge) regression.
Perhaps surprisingly, even without a tuning parameter (i.e., $p=1$ for the power parameter of the pGMM kernel), the pGMM kernel already performs well.
Perhaps also surprisingly, the best performance (in terms of $L$ regression loss) is often attained at $p>2$, in some cases at $pgggg 2$.
- Score: 21.607059258448594
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we demonstrate the advantage of the pGMM (``powered generalized
min-max'') kernel in the context of (ridge) regression. In recent prior
studies, the pGMM kernel has been extensively evaluated for classification
tasks, for logistic regression, support vector machines, as well as deep neural
networks. In this paper, we provide an experimental study on ridge regression,
to compare the pGMM kernel regression with the ordinary ridge linear regression
as well as the RBF kernel ridge regression. Perhaps surprisingly, even without
a tuning parameter (i.e., $p=1$ for the power parameter of the pGMM kernel),
the pGMM kernel already performs well. Furthermore, by tuning the parameter
$p$, this (deceptively simple) pGMM kernel even performs quite comparably to
boosted trees.
Boosting and boosted trees are very popular in machine learning practice. For
regression tasks, typically, practitioners use $L_2$ boost, i.e., for
minimizing the $L_2$ loss. Sometimes for the purpose of robustness, the $L_1$
boost might be a choice. In this study, we implement $L_p$ boost for $p\geq 1$
and include it in the package of ``Fast ABC-Boost''. Perhaps also surprisingly,
the best performance (in terms of $L_2$ regression loss) is often attained at
$p>2$, in some cases at $p\gg 2$. This phenomenon has already been demonstrated
by Li et al (UAI 2010) in the context of k-nearest neighbor classification
using $L_p$ distances. In summary, the implementation of $L_p$ boost provides
practitioners the additional flexibility of tuning boosting algorithms for
potentially achieving better accuracy in regression applications.
Related papers
- Asymptotics of Random Feature Regression Beyond the Linear Scaling
Regime [22.666759017118796]
Recent advances in machine learning have been achieved by using overparametrized models trained until near the training data.
How does model complexity and generalization depend on the number of parameters $p$?
In particular, RFRR exhibits an intuitive trade-off between approximation and generalization power.
arXiv Detail & Related papers (2024-03-13T00:59:25Z) - Optimal Rates of Kernel Ridge Regression under Source Condition in Large
Dimensions [15.988264513040903]
We study the large-dimensional behavior of kernel ridge regression (KRR) where the sample size $n asymp dgamma$ for some $gamma > 0$.
Our results illustrate that the curves of rate varying along $gamma$ exhibit the periodic plateau behavior and the multiple descent behavior.
arXiv Detail & Related papers (2024-01-02T16:14:35Z) - Optimal Rate of Kernel Regression in Large Dimensions [13.641780902673792]
We first build a general tool to characterize the upper bound and the minimax lower bound of kernel regression for large dimensional data.
We utilize the new tool to show that the minimax rate of the excess risk of kernel regression is $n-1/2$ when $nasymp dgamma$ for $gamma =2, 4, 6, 8, cdots$.
arXiv Detail & Related papers (2023-09-08T11:29:05Z) - SKI to go Faster: Accelerating Toeplitz Neural Networks via Asymmetric
Kernels [69.47358238222586]
Toeplitz Neural Networks (TNNs) are a recent sequence model with impressive results.
We aim to reduce O(n) computational complexity and O(n) relative positional encoder (RPE) multi-layer perceptron (MLP) and decay bias calls.
For bidirectional models, this motivates a sparse plus low-rank Toeplitz matrix decomposition.
arXiv Detail & Related papers (2023-05-15T21:25:35Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - Active Sampling for Linear Regression Beyond the $\ell_2$ Norm [70.49273459706546]
We study active sampling algorithms for linear regression, which aim to query only a small number of entries of a target vector.
We show that this dependence on $d$ is optimal, up to logarithmic factors.
We also provide the first total sensitivity upper bound $O(dmax1,p/2log2 n)$ for loss functions with at most degree $p$ growth.
arXiv Detail & Related papers (2021-11-09T00:20:01Z) - Mixability made efficient: Fast online multiclass logistic regression [68.8204255655161]
We show that mixability can be a powerful tool to obtain algorithms with optimal regret.
The resulting methods often suffer from high computational complexity which has reduced their practical applicability.
arXiv Detail & Related papers (2021-10-08T08:22:05Z) - Online nonparametric regression with Sobolev kernels [99.12817345416846]
We derive the regret upper bounds on the classes of Sobolev spaces $W_pbeta(mathcalX)$, $pgeq 2, beta>fracdp$.
The upper bounds are supported by the minimax regret analysis, which reveals that in the cases $beta> fracd2$ or $p=infty$ these rates are (essentially) optimal.
arXiv Detail & Related papers (2021-02-06T15:05:14Z) - Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank
Approximation [23.06440095688755]
A common technique for compressing a neural network is to compute the $k$-rank $ell$ approximation $A_k,2$ of the matrix $AinmathbbRntimes d$ that corresponds to a fully connected layer (or embedding layer)
Here, $d$ is the number of the neurons in the layer, $n$ is the number in the next one, and $A_k,2$ can be stored in $O(n+d)k)$ memory instead of $O(nd)$.
This
arXiv Detail & Related papers (2020-09-11T20:21:42Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Statistical-Query Lower Bounds via Functional Gradients [19.5924910463796]
We show that any statistical-query algorithm with tolerance $n- (1/epsilon)b$ must use at least $2nc epsilon$ queries for some constant $b.
Our results rule out general (as opposed to correlational) SQ learning algorithms, which is unusual for real-valued learning problems.
arXiv Detail & Related papers (2020-06-29T05:15:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.