BBTv2: Pure Black-Box Optimization Can Be Comparable to Gradient Descent
for Few-Shot Learning
- URL: http://arxiv.org/abs/2205.11200v1
- Date: Mon, 23 May 2022 11:10:19 GMT
- Title: BBTv2: Pure Black-Box Optimization Can Be Comparable to Gradient Descent
for Few-Shot Learning
- Authors: Tianxiang Sun, Zhengfu He, Hong Qian, Xuanjing Huang, Xipeng Qiu
- Abstract summary: Black-Box Tuning is a derivative-free approach to optimize continuous prompt tokens prepended to the input of language models.
We present BBTv2, a pure black-box optimization approach that can drive language models to achieve comparable results to gradient-based optimization.
- Score: 83.26610968655815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Black-Box Tuning (BBT) is a derivative-free approach to optimize continuous
prompt tokens prepended to the input of language models. Although BBT has
achieved comparable performance to full model tuning on simple classification
tasks under few-shot settings, it requires pre-trained prompt embedding to
match model tuning on hard tasks (e.g., entailment tasks), and therefore does
not completely get rid of the dependence on gradients. In this paper we present
BBTv2, a pure black-box optimization approach that can drive language models to
achieve comparable results to gradient-based optimization. In particular, we
prepend continuous prompt tokens to every layer of the language model and
propose a divide-and-conquer algorithm to alternately optimize the prompt
tokens at different layers. For the optimization at each layer, we perform
derivative-free optimization in a low-dimensional subspace, which is then
randomly projected to the original prompt parameter space. Experimental results
show that BBTv2 not only outperforms BBT by a large margin, but also achieves
comparable or even better performance than full model tuning and
state-of-the-art parameter-efficient methods (e.g., Adapter, LoRA, BitFit,
etc.) under few-shot learning settings, while maintaining much fewer tunable
parameters.
Related papers
- Black-Box Tuning of Vision-Language Models with Effective Gradient
Approximation [71.21346469382821]
We introduce collaborative black-box tuning (CBBT) for both textual prompt optimization and output feature adaptation for black-box models.
CBBT is extensively evaluated on eleven downstream benchmarks and achieves remarkable improvements compared to existing black-box VL adaptation methods.
arXiv Detail & Related papers (2023-12-26T06:31:28Z) - Multi-fidelity Constrained Optimization for Stochastic Black Box
Simulators [1.6385815610837167]
We introduce the algorithm Scout-Nd (Stochastic Constrained Optimization for N dimensions) to tackle the issues mentioned earlier.
Scout-Nd efficiently estimates the gradient, reduces the noise of the estimator gradient, and applies multi-fidelity schemes to further reduce computational effort.
We validate our approach on standard benchmarks, demonstrating its effectiveness in optimizing parameters highlighting better performance compared to existing methods.
arXiv Detail & Related papers (2023-11-25T23:36:38Z) - Towards Adaptive Prefix Tuning for Parameter-Efficient Language Model
Fine-tuning [32.84435258519842]
We propose Adaptive Prefix Tuning (APT) to adjust the prefix in terms of both fine-grained token level and coarse-grained layer level with a gate mechanism.
Experiments on the SuperGLUE and NER datasets show the effectiveness of APT.
arXiv Detail & Related papers (2023-05-24T14:51:01Z) - Make Prompt-based Black-Box Tuning Colorful: Boosting Model Generalization from Three Orthogonal Perspectives [28.138689389803034]
Large language models (LLMs) have shown increasing power on various natural language processing (NLP) tasks.
Black-box tuning has been proposed to address this problem by optimizing task-specific prompts without accessing the gradients and hidden representations.
We describe BBT-RGB, a suite of straightforward and complementary techniques for enhancing the efficiency and performance of black-box optimization.
arXiv Detail & Related papers (2023-05-14T07:33:59Z) - Trainable Projected Gradient Method for Robust Fine-tuning [36.470333094917436]
We propose Trainable Projected Gradient Method (TPGM) to automatically learn the constraint imposed for each layer for a fine-grained fine-tuning regularization.
This is motivated by formulating fine-tuning as a bi-level constrained optimization problem.
We show that TPGM outperforms existing fine-tuning methods in OOD performance while matching the best in-distribution (ID) performance.
arXiv Detail & Related papers (2023-03-19T17:30:44Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Prefix-Tuning: Optimizing Continuous Prompts for Generation [85.6357778621526]
Fine-tuning is the de facto way to leverage large pretrained language models to perform downstream tasks.
We propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks.
We find that by learning only 0.1% of the parameters, prefix-tuning obtains comparable performance in the full data setting.
arXiv Detail & Related papers (2021-01-01T08:00:36Z) - Self-Tuning Stochastic Optimization with Curvature-Aware Gradient
Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics.
We prove that our model-based procedure converges in noisy gradient setting.
This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.