Related papers: Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective

Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective

URL: http://arxiv.org/abs/2302.01425v3
Date: Sun, 4 Jun 2023 07:58:10 GMT
Title: Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective
Authors: Michael E. Sander, Joan Puigcerver, Josip Djolonga, Gabriel Peyr\'e and Mathieu Blondel
Abstract summary: The top-k operator returns a sparse vector, where the non-zero values correspond to the k largest values of the input. We view the top-k operator as a linear program over the permutahedron, the convex hull of permutations. Our framework is significantly more general than the existing one and allows for example to express top-k operators that select values in magnitude.
Score: 21.6146047576295
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The top-k operator returns a sparse vector, where the non-zero values correspond to the k largest values of the input. Unfortunately, because it is a discontinuous function, it is difficult to incorporate in neural networks trained end-to-end with backpropagation. Recent works have considered differentiable relaxations, based either on regularization or perturbation techniques. However, to date, no approach is fully differentiable and sparse. In this paper, we propose new differentiable and sparse top-k operators. We view the top-k operator as a linear program over the permutahedron, the convex hull of permutations. We then introduce a p-norm regularization term to smooth out the operator, and show that its computation can be reduced to isotonic optimization. Our framework is significantly more general than the existing one and allows for example to express top-k operators that select values in magnitude. On the algorithmic side, in addition to pool adjacent violator (PAV) algorithms, we propose a new GPU/TPU-friendly Dykstra algorithm to solve isotonic optimization problems. We successfully use our operators to prune weights in neural networks, to fine-tune vision transformers, and as a router in sparse mixture of experts.

Related papers

Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
On Model Compression for Neural Networks: Framework, Algorithm, and Convergence Guarantee [21.818773423324235]
This paper focuses on two model compression techniques: low-rank approximation and weight approximation. In this paper, a holistic framework is proposed for model compression from a novel perspective of non optimization.
arXiv Detail & Related papers (2023-03-13T02:14:42Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
A Unified Framework for Implicit Sinkhorn Differentiation [58.56866763433335]
We propose an algorithm that obtains analytical gradients of a Sinkhorn layer via implicit differentiation. We show that it is computationally more efficient, particularly when resources like GPU memory are scarce.
arXiv Detail & Related papers (2022-05-13T14:45:31Z)
Alternating Mahalanobis Distance Minimization for Stable and Accurate CP Decomposition [4.847980206213335]
We introduce a new formulation for deriving singular values and vectors of a tensor by considering the critical points of a function different from what is used in the previous work. We show that a subsweep of this algorithm can achieve a superlinear convergence rate for exact CPD with known rank. We then view the algorithm as optimizing a Mahalanobis distance with respect to each factor with ground metric dependent on the other factors.
arXiv Detail & Related papers (2022-04-14T19:56:36Z)
Divide and Learn: A Divide and Conquer Approach for Predict+Optimize [50.03608569227359]
The predict+optimize problem combines machine learning ofproblem coefficients with a optimization prob-lem that uses the predicted coefficients. We show how to directlyexpress the loss of the optimization problem in terms of thepredicted coefficients as a piece-wise linear function. We propose a novel divide and algorithm to tackle optimization problems without this restriction and predict itscoefficients using the optimization loss.
arXiv Detail & Related papers (2020-12-04T00:26:56Z)
Deep neural networks for inverse problems with pseudodifferential operators: an application to limited-angle tomography [0.4110409960377149]
We propose a novel convolutional neural network (CNN) designed for learning pseudodifferential operators ($Psi$DOs) in the context of linear inverse problems. We show that, under rather general assumptions on the forward operator, the unfolded iterations of ISTA can be interpreted as the successive layers of a CNN. In particular, we prove that, in the case of LA-CT, the operations of upscaling, downscaling and convolution, can be exactly determined by combining the convolutional nature of the limited angle X-ray transform and basic properties defining a wavelet system.
arXiv Detail & Related papers (2020-06-02T14:03:41Z)
Differentiable Top-k Operator with Optimal Transport [135.36099648554054]
The SOFT top-k operator approximates the output of the top-k operation as the solution of an Entropic Optimal Transport (EOT) problem. We apply the proposed operator to the k-nearest neighbors and beam search algorithms, and demonstrate improved performance.
arXiv Detail & Related papers (2020-02-16T04:57:52Z)
Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients. We introduce a new sparsity operator: The random-top-k operator. Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.