Stochastic Two Points Method for Deep Model Zeroth-order Optimization
- URL: http://arxiv.org/abs/2402.01621v3
- Date: Mon, 27 May 2024 14:56:01 GMT
- Title: Stochastic Two Points Method for Deep Model Zeroth-order Optimization
- Authors: Yijiang Pang, Jiayu Zhou,
- Abstract summary: This paper introduces an efficient Two-Point (S2P) approach within the gradient-free regime.
We present the theoretical convergence properties of S2P under the general and relaxed smoothness assumptions.
We show that VS2P is highly effective in optimizing objectives for deep models.
- Score: 32.459322001738144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large foundation models, such as large language models, have performed exceptionally well in various application scenarios. Building or fully fine-tuning such large models is usually prohibitive due to either hardware budget or lack of access to backpropagation. The zeroth-order methods offer a promising direction for tackling this challenge, where only forward passes are needed to update the model. This paper introduces an efficient Stochastic Two-Point (S2P) approach within the gradient-free regime. We present the theoretical convergence properties of S2P under the general and relaxed smoothness assumptions, and the derived results help understand and inherently connect the two popular types of zeroth-order methods, basic random search and stochastic three-point method. The theoretical properties also shed light on a Variant of S2P (VS2P), through exploiting our new convergence properties that better represent the dynamics of deep models in training. Our comprehensive empirical results show that VS2P is highly effective in optimizing objectives for deep models. It outperforms or achieves competitive performance compared to standard methods across various model types and scales.
Related papers
- Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods [4.028503203417233]
We propose an efficient, broadly applicable, and theoretically justified policy gradient algorithm for fine-tuning discrete diffusion models over non-differentiable rewards.
Our numerical experiments across several discrete generative tasks demonstrate the scalability and efficiency of our method.
arXiv Detail & Related papers (2025-02-03T14:20:19Z) - Fast Solvers for Discrete Diffusion Models: Theory and Applications of High-Order Algorithms [31.42317398879432]
Current inference approaches mainly fall into two categories: exact simulation and approximate methods such as $tau$-leaping.
In this work, we advance the latter category by tailoring the first extension of high-order numerical inference schemes to discrete diffusion models.
We rigorously analyze the proposed schemes and establish the second-order accuracy of the $theta$-trapezoidal method in KL divergence.
arXiv Detail & Related papers (2025-02-01T00:25:21Z) - Singular Value Scaling: Efficient Generative Model Compression via Pruned Weights Refinement [9.454314879815337]
generative models often exhibit dominant singular vectors, hindering fine-tuning efficiency and leading to suboptimal performance.
We introduce Singular Value Scaling (SVS), a versatile technique for refining pruned weights, applicable to both model types.
SVS improves compression performance across model types without additional training costs.
arXiv Detail & Related papers (2024-12-23T08:40:08Z) - Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC)
LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses.
LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z) - The Convex Landscape of Neural Networks: Characterizing Global Optima
and Stationary Points via Lasso Models [75.33431791218302]
Deep Neural Network Network (DNN) models are used for programming purposes.
In this paper we examine the use of convex neural recovery models.
We show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program.
We also show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program.
arXiv Detail & Related papers (2023-12-19T23:04:56Z) - DiffComplete: Diffusion-based Generative 3D Shape Completion [114.43353365917015]
We introduce a new diffusion-based approach for shape completion on 3D range scans.
We strike a balance between realism, multi-modality, and high fidelity.
DiffComplete sets a new SOTA performance on two large-scale 3D shape completion benchmarks.
arXiv Detail & Related papers (2023-06-28T16:07:36Z) - Human Trajectory Prediction via Neural Social Physics [63.62824628085961]
Trajectory prediction has been widely pursued in many fields, and many model-based and model-free methods have been explored.
We propose a new method combining both methodologies based on a new Neural Differential Equation model.
Our new model (Neural Social Physics or NSP) is a deep neural network within which we use an explicit physics model with learnable parameters.
arXiv Detail & Related papers (2022-07-21T12:11:18Z) - On the Role of Bidirectionality in Language Model Pre-Training [85.14614350372004]
We study the role of bidirectionality in next token prediction, text infilling, zero-shot priming and fine-tuning.
We train models with up to 6.7B parameters, and find differences to remain consistent at scale.
arXiv Detail & Related papers (2022-05-24T02:25:05Z) - Robust Binary Models by Pruning Randomly-initialized Networks [57.03100916030444]
We propose ways to obtain robust models against adversarial attacks from randomly-d binary networks.
We learn the structure of the robust model by pruning a randomly-d binary network.
Our method confirms the strong lottery ticket hypothesis in the presence of adversarial attacks.
arXiv Detail & Related papers (2022-02-03T00:05:08Z) - A Second look at Exponential and Cosine Step Sizes: Simplicity,
Adaptivity, and Performance [23.89815527019194]
Gradient Descent (SGD) is a popular tool in large-scale machine learning models.
It is highly variable, depending crucially on the choice of the step sizes.
A variety of strategies for tuning the step sizes have been proposed.
arXiv Detail & Related papers (2020-02-12T23:10:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.