Related papers: Truncated Kernel Stochastic Gradient Descent on Spheres

Truncated Kernel Stochastic Gradient Descent on Spheres

URL: http://arxiv.org/abs/2410.01570v2
Date: Fri, 4 Oct 2024 13:51:16 GMT
Title: Truncated Kernel Stochastic Gradient Descent on Spheres
Authors: JinHui Bai, Lei Shi,
Abstract summary: Inspired by the structure of spherical harmonics, we propose the truncated kernel gradient descent (T- Kernel SGD) algorithm. T- Kernel SGD employs a "truncation" operation, enabling the application of series-based kernels function in gradient descent. In contrast to traditional kernel SGD, T- Kernel SGD is more effective in balancing bias and variance.
Score: 1.4583059436979549
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Inspired by the structure of spherical harmonics, we propose the truncated kernel stochastic gradient descent (T-kernel SGD) algorithm with a least-square loss function for spherical data fitting. T-kernel SGD employs a "truncation" operation, enabling the application of series-based kernels function in stochastic gradient descent, thereby avoiding the difficulties of finding suitable closed-form kernel functions in high-dimensional spaces. In contrast to traditional kernel SGD, T-kernel SGD is more effective in balancing bias and variance by dynamically adjusting the hypothesis space during iterations. The most significant advantage of the proposed algorithm is that it can achieve theoretically optimal convergence rates using a constant step size (independent of the sample size) while overcoming the inherent saturation problem of kernel SGD. Additionally, we leverage the structure of spherical polynomials to derive an equivalent T-kernel SGD, significantly reducing storage and computational costs compared to kernel SGD. Typically, T-kernel SGD requires only $\mathcal{O}(n^{1+\frac{d}{d-1}\epsilon})$ computational complexity and $\mathcal{O}(n^{\frac{d}{d-1}\epsilon})$ storage to achieve optimal rates for the d-dimensional sphere, where $0<\epsilon<\frac{1}{2}$ can be arbitrarily small if the optimal fitting or the underlying space possesses sufficient regularity. This regularity is determined by the smoothness parameter of the objective function and the decaying rate of the eigenvalues of the integral operator associated with the kernel function, both of which reflect the difficulty of the estimation problem. Our main results quantitatively characterize how this prior information influences the convergence of T-kernel SGD. The numerical experiments further validate the theoretical findings presented in this paper.

Related papers

GRAPE optimization for open quantum systems with time-dependent decoherence rates driven by coherent and incoherent controls [77.34726150561087]
The GRadient Ascent Pulse Engineering (GRAPE) method is widely used for optimization in quantum control. We adopt GRAPE method for optimizing objective functionals for open quantum systems driven by both coherent and incoherent controls. The efficiency of the algorithm is demonstrated through numerical simulations for the state-to-state transition problem.
arXiv Detail & Related papers (2023-07-17T13:37:18Z)
On Convergence of Incremental Gradient for Non-Convex Smooth Functions [63.51187646914962]
In machine learning and network optimization, algorithms like shuffle SGD are popular due to minimizing the number of misses and good cache. This paper delves into the convergence properties SGD algorithms with arbitrary data ordering.
arXiv Detail & Related papers (2023-05-30T17:47:27Z)
CEDAS: A Compressed Decentralized Stochastic Gradient Method with Improved Convergence [9.11726703830074]
In this paper, we consider solving the distributed optimization problem under the communication restricted setting. We show the method over com-pressed exact diffusion termed CEDAS" In particular, when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when
arXiv Detail & Related papers (2023-01-14T09:49:15Z)
Utilising the CLT Structure in Stochastic Gradient based Sampling : Improved Analysis and Faster Algorithms [14.174806471635403]
We consider approximations of sampling algorithms, such as Gradient Langevin Dynamics (SGLD) and the Random Batch Method (RBM) for Interacting Particle Dynamcs (IPD) We observe that the noise introduced by the approximation is nearly Gaussian due to the Central Limit Theorem (CLT) while the driving Brownian motion is exactly Gaussian. We harness this structure to absorb the approximation error inside the diffusion process, and obtain improved convergence guarantees for these algorithms.
arXiv Detail & Related papers (2022-06-08T10:17:40Z)
Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks [0.0]
We study the dynamics of a neural network in function space when optimizing the mean squared error via gradient flow. We show that the network learns eigenfunctions of an integral operator $T_Kinfty$ determined by the Neural Tangent Kernel (NTK) We conclude that damped deviations offers a simple and unifying perspective of the dynamics when optimizing the squared error.
arXiv Detail & Related papers (2022-01-12T23:28:41Z)
Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector Problems [98.34292831923335]
Motivated by the problem of online correlation analysis, we propose the emphStochastic Scaled-Gradient Descent (SSD) algorithm. We bring these ideas together in an application to online correlation analysis, deriving for the first time an optimal one-time-scale algorithm with an explicit rate of local convergence to normality.
arXiv Detail & Related papers (2021-12-29T18:46:52Z)
Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent: Convergence Guarantees and Empirical Benefits [21.353189917487512]
gradient descent (SGD) and its variants have established themselves as the go-to algorithms for machine learning problems. We take a step forward by proving minibatch SGD converges to a critical point of the full log-likelihood loss function. Our theoretical guarantees hold provided that the kernel functions exhibit exponential or eigendecay.
arXiv Detail & Related papers (2021-11-19T22:28:47Z)
On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD) We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting. We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z)
Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
We use kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process. We derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator. We prove minimax lower bounds over sub-classes of MRPs.
arXiv Detail & Related papers (2021-09-24T14:48:20Z)
Robust Training in High Dimensions via Block Coordinate Geometric Median Descent [69.47594803719333]
Geometric median (textGm) is a classical method in statistics for achieving a robust estimation of the uncorrupted data. In this paper, we show that by that by applying textscGm to only a chosen block of coordinates at a time, one can retain a breakdown point of 0.5 judiciously for smooth nontext problems.
arXiv Detail & Related papers (2021-06-16T15:55:50Z)
Improving the Transient Times for Distributed Stochastic Gradient Methods [5.215491794707911]
We study a distributed gradient algorithm, called exact diffusion adaptive stepsizes (EDAS) We show EDAS achieves the same network independent convergence rate as centralized gradient descent (SGD) To the best of our knowledge, EDAS achieves the shortest time when the average of the $n$ cost functions is strongly convex.
arXiv Detail & Related papers (2021-05-11T08:09:31Z)
Flow-based Kernel Prior with Application to Blind Super-Resolution [143.21527713002354]
Kernel estimation is generally one of the key problems for blind image super-resolution (SR) This paper proposes a normalizing flow-based kernel prior (FKP) for kernel modeling. Experiments on synthetic and real-world images demonstrate that the proposed FKP can significantly improve the kernel estimation accuracy.
arXiv Detail & Related papers (2021-03-29T22:37:06Z)
Convergence of Gaussian-smoothed optimal transport distance with sub-gamma distributions and dependent samples [12.77426855794452]
This paper provides convergence guarantees for estimating the GOT distance under more general settings. A key step in our analysis is to show that the GOT distance is dominated by a family of kernel maximum mean discrepancy distances.
arXiv Detail & Related papers (2021-02-28T04:30:23Z)
Convergence Analysis of Homotopy-SGD for non-convex optimization [43.71213126039448]
We present a first-order algorithm based on a combination of homotopy methods and SGD, called Gradienty-Stoch Descent (H-SGD) Under some assumptions, we conduct a theoretical analysis of the proposed problem. Experimental results show that H-SGD can outperform SGD.
arXiv Detail & Related papers (2020-11-20T09:50:40Z)
Faster Convergence of Stochastic Gradient Langevin Dynamics for Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave. At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z)
On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD) We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z)
Spectral density estimation with the Gaussian Integral Transform [91.3755431537592]
spectral density operator $hatrho(omega)=delta(omega-hatH)$ plays a central role in linear response theory. We describe a near optimal quantum algorithm providing an approximation to the spectral density.
arXiv Detail & Related papers (2020-04-10T03:14:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.