SIMPLE: A Gradient Estimator for $k$-Subset Sampling
- URL: http://arxiv.org/abs/2210.01941v2
- Date: Thu, 6 Jun 2024 23:33:51 GMT
- Title: SIMPLE: A Gradient Estimator for $k$-Subset Sampling
- Authors: Kareem Ahmed, Zhe Zeng, Mathias Niepert, Guy Van den Broeck,
- Abstract summary: In this work, we fall back to discrete $k$-subset sampling on the forward pass.
We show that our gradient estimator, SIMPLE, exhibits lower bias and variance compared to state-of-the-art estimators.
Empirical results show improved performance on learning to explain and sparse linear regression.
- Score: 42.38652558807518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: $k$-subset sampling is ubiquitous in machine learning, enabling regularization and interpretability through sparsity. The challenge lies in rendering $k$-subset sampling amenable to end-to-end learning. This has typically involved relaxing the reparameterized samples to allow for backpropagation, with the risk of introducing high bias and high variance. In this work, we fall back to discrete $k$-subset sampling on the forward pass. This is coupled with using the gradient with respect to the exact marginals, computed efficiently, as a proxy for the true gradient. We show that our gradient estimator, SIMPLE, exhibits lower bias and variance compared to state-of-the-art estimators, including the straight-through Gumbel estimator when $k = 1$. Empirical results show improved performance on learning to explain and sparse linear regression. We provide an algorithm for computing the exact ELBO for the $k$-subset distribution, obtaining significantly lower loss compared to SOTA.
Related papers
- Improving the Convergence Rates of Forward Gradient Descent with Repeated Sampling [5.448070998907116]
Forward gradient descent (FGD) has been proposed as a biologically more plausible alternative of gradient descent.
In this paper we show that by computing $ell$ FGD steps based on each training sample, this suboptimality factor becomes $d/(ell wedge d)$.
We also show that FGD with repeated sampling can adapt to low-dimensional structure in the input distribution.
arXiv Detail & Related papers (2024-11-26T16:28:16Z) - Revisiting Score Function Estimators for $k$-Subset Sampling [5.464421236280698]
We show how to efficiently compute the $k$-subset distribution's score function using a discrete Fourier transform.
The resulting estimator provides both exact samples and unbiased gradient estimates.
Experiments in feature selection show results competitive with current methods, despite weaker assumptions.
arXiv Detail & Related papers (2024-07-22T21:26:39Z) - Retire: Robust Expectile Regression in High Dimensions [3.9391041278203978]
Penalized quantile and expectile regression methods offer useful tools to detect heteroscedasticity in high-dimensional data.
We propose and study (penalized) robust expectile regression (retire)
We show that the proposed procedure can be efficiently solved by a semismooth Newton coordinate descent algorithm.
arXiv Detail & Related papers (2022-12-11T18:03:12Z) - Adaptive Sketches for Robust Regression with Importance Sampling [64.75899469557272]
We introduce data structures for solving robust regression through gradient descent (SGD)
Our algorithm effectively runs $T$ steps of SGD with importance sampling while using sublinear space and just making a single pass over the data.
arXiv Detail & Related papers (2022-07-16T03:09:30Z) - Heavy-tailed Streaming Statistical Estimation [58.70341336199497]
We consider the task of heavy-tailed statistical estimation given streaming $p$ samples.
We design a clipped gradient descent and provide an improved analysis under a more nuanced condition on the noise of gradients.
arXiv Detail & Related papers (2021-08-25T21:30:27Z) - Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient
Estimator [93.05919133288161]
We show that the variance of the straight-through variant of the popular Gumbel-Softmax estimator can be reduced through Rao-Blackwellization.
This provably reduces the mean squared error.
We empirically demonstrate that this leads to variance reduction, faster convergence, and generally improved performance in two unsupervised latent variable models.
arXiv Detail & Related papers (2020-10-09T22:54:38Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z) - Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction.
We adaptively select the descent steps where the measure reduction is carried out.
We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z) - Learning Entangled Single-Sample Distributions via Iterative Trimming [28.839136703139225]
We analyze a simple and computationally efficient method based on iteratively trimming samples and re-estimating the parameter on the trimmed sample set.
We show that the method in logarithmic iterations outputs an estimation whose error only depends on the noise level of the $lceil alpha n rceil$-th noisiest data point.
arXiv Detail & Related papers (2020-04-20T18:37:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.