Related papers: Fast Differentiable Clipping-Aware Normalization and Rescaling

Fast Differentiable Clipping-Aware Normalization and Rescaling

URL: http://arxiv.org/abs/2007.07677v1
Date: Wed, 15 Jul 2020 13:43:22 GMT
Title: Fast Differentiable Clipping-Aware Normalization and Rescaling
Authors: Jonas Rauber, Matthias Bethge
Abstract summary: We show that the optimal rescaling can be found analytically using a fast and differentiable algorithm. Our algorithm works for any p-norm and can be used to train neural networks on inputs normalized to perturbations.
Score: 22.320256458354137
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Rescaling a vector $\vec{\delta} \in \mathbb{R}^n$ to a desired length is a common operation in many areas such as data science and machine learning. When the rescaled perturbation $\eta \vec{\delta}$ is added to a starting point $\vec{x} \in D$ (where $D$ is the data domain, e.g. $D = [0, 1]^n$), the resulting vector $\vec{v} = \vec{x} + \eta \vec{\delta}$ will in general not be in $D$. To enforce that the perturbed vector $v$ is in $D$, the values of $\vec{v}$ can be clipped to $D$. This subsequent element-wise clipping to the data domain does however reduce the effective perturbation size and thus interferes with the rescaling of $\vec{\delta}$. The optimal rescaling $\eta$ to obtain a perturbation with the desired norm after the clipping can be iteratively approximated using a binary search. However, such an iterative approach is slow and non-differentiable. Here we show that the optimal rescaling can be found analytically using a fast and differentiable algorithm. Our algorithm works for any p-norm and can be used to train neural networks on inputs with normalized perturbations. We provide native implementations for PyTorch, TensorFlow, JAX, and NumPy based on EagerPy.

Related papers

Beyond Worst-Case Dimensionality Reduction for Sparse Vectors [47.927989749887864]
We study beyond worst-case dimensionality reduction for $s$-sparse vectors. For any collection $X$ of $s$-sparse vectors in $mathbbRO(s2)$, there exists a linear map to $mathbbRO(s2)$ which emphexactly preserves the norm of $99%$ of the vectors in $X$ in any $ell_p$ norm. We show that both the non-linearity of $f$ and the non-negativity of $
arXiv Detail & Related papers (2025-02-27T08:17:47Z)
Efficient $1$-bit tensor approximations [1.104960878651584]
Our algorithm yields efficient signed cut decompositions in $20$ lines of pseudocode. We approximate the weight matrices in the open textitMistral-7B-v0.1 Large Language Model to a $50%$ spatial compression.
arXiv Detail & Related papers (2024-10-02T17:56:32Z)
Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch. We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work. We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z)
Optimal Query Complexities for Dynamic Trace Estimation [59.032228008383484]
We consider the problem of minimizing the number of matrix-vector queries needed for accurate trace estimation in the dynamic setting where our underlying matrix is changing slowly. We provide a novel binary tree summation procedure that simultaneously estimates all $m$ traces up to $epsilon$ error with $delta$ failure probability. Our lower bounds (1) give the first tight bounds for Hutchinson's estimator in the matrix-vector product model with Frobenius norm error even in the static setting, and (2) are the first unconditional lower bounds for dynamic trace estimation.
arXiv Detail & Related papers (2022-09-30T04:15:44Z)
Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products [58.05771390012827]
We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm. Our main result is an algorithm that uses only $tildeO(k/sqrtepsilon)$ matrix-vector products.
arXiv Detail & Related papers (2022-02-10T16:10:41Z)
Active Sampling for Linear Regression Beyond the $\ell_2$ Norm [70.49273459706546]
We study active sampling algorithms for linear regression, which aim to query only a small number of entries of a target vector. We show that this dependence on $d$ is optimal, up to logarithmic factors. We also provide the first total sensitivity upper bound $O(dmax1,p/2log2 n)$ for loss functions with at most degree $p$ growth.
arXiv Detail & Related papers (2021-11-09T00:20:01Z)
Tractability from overparametrization: The example of the negative perceptron [9.077560551700163]
We analyze a linear programming algorithm to characterize the corresponding threshold threshold $delta_textlin(kappa)$. We observe a gap between the threshold $delta_textlin(kappa)$, raising question of the behavior of other algorithms.
arXiv Detail & Related papers (2021-10-28T01:00:13Z)
Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure. We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds. We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z)
On Efficient Low Distortion Ultrametric Embedding [18.227854382422112]
A widely-used method to preserve the underlying hierarchical structure of the data is to find an embedding of the data into a tree or an ultrametric. In this paper, we provide a new algorithm which takes as input a set of points isometric in $mathbbRd2 (for universal constant $rho>1$) to output an ultrametric $Delta. We show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.
arXiv Detail & Related papers (2020-08-15T11:06:45Z)
The Average-Case Time Complexity of Certifying the Restricted Isometry Property [66.65353643599899]
In compressed sensing, the restricted isometry property (RIP) on $M times N$ sensing matrices guarantees efficient reconstruction of sparse vectors. We investigate the exact average-case time complexity of certifying the RIP property for $Mtimes N$ matrices with i.i.d. $mathcalN(0,1/M)$ entries.
arXiv Detail & Related papers (2020-05-22T16:55:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.