Related papers: Tune As You Scale: Hyperparameter Optimization For Compute Efficient Training

Tune As You Scale: Hyperparameter Optimization For Compute Efficient Training

URL: http://arxiv.org/abs/2306.08055v1
Date: Tue, 13 Jun 2023 18:22:24 GMT
Title: Tune As You Scale: Hyperparameter Optimization For Compute Efficient Training
Authors: Abraham J. Fetterman, Ellie Kitanidis, Joshua Albrecht, Zachary Polizzi, Bryden Fogelman, Maksis Knutins, Bartosz Wr\'oblewski, James B. Simon, Kanjun Qiu
Abstract summary: We propose a practical method for robustly tuning large models. CarBS performs local search around the performance-cost frontier. Among our results, we effectively solve the entire ProcGen benchmark just by tuning a simple baseline.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hyperparameter tuning of deep learning models can lead to order-of-magnitude performance gains for the same amount of compute. Despite this, systematic tuning is uncommon, particularly for large models, which are expensive to evaluate and tend to have many hyperparameters, necessitating difficult judgment calls about tradeoffs, budgets, and search bounds. To address these issues and propose a practical method for robustly tuning large models, we present Cost-Aware Pareto Region Bayesian Search (CARBS), a Bayesian optimization algorithm that performs local search around the performance-cost Pareto frontier. CARBS does well even in unbounded search spaces with many hyperparameters, learns scaling relationships so that it can tune models even as they are scaled up, and automates much of the "black magic" of tuning. Among our results, we effectively solve the entire ProcGen benchmark just by tuning a simple baseline (PPO, as provided in the original ProcGen paper). We also reproduce the model size vs. training tokens scaling result from the Chinchilla project (Hoffmann et al. 2022), while simultaneously discovering scaling laws for every other hyperparameter, via an easy automated process that uses significantly less compute and is applicable to any deep learning problem (not just language models).

Related papers

Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining [56.58170370127227]
We show that optimal learning rate follows a power-law relationship with both model parameters and data sizes, while optimal batch size scales primarily with data sizes. This work is the first work that unifies different model shapes and structures, such as Mixture-of-Experts models and dense transformers.
arXiv Detail & Related papers (2025-03-06T18:58:29Z)
A Comparative Study of Hyperparameter Tuning Methods [0.0]
Tree-structured Parzen Estimator (TPE), Genetic Search, and Random Search are evaluated across regression and classification tasks. Random Search excelled in regression tasks, while TPE was more effective for classification tasks.
arXiv Detail & Related papers (2024-08-29T10:35:07Z)
E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation. Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z)
Parameter-efficient Tuning of Large-scale Multimodal Foundation Model [68.24510810095802]
We propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges. Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning. A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach.
arXiv Detail & Related papers (2023-05-15T06:40:56Z)
Learning to Optimize Permutation Flow Shop Scheduling via Graph-based Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems. We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately. Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z)
Pre-training helps Bayesian optimization too [49.28382118032923]
We seek an alternative practice for setting functional priors. In particular, we consider the scenario where we have data from similar functions that allow us to pre-train a tighter distribution a priori. Our results show that our method is able to locate good hyper parameters at least 3 times more efficiently than the best competing methods.
arXiv Detail & Related papers (2022-07-07T04:42:54Z)
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made. parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task. In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z)
Towards Robust and Automatic Hyper-Parameter Tunning [39.04604349338802]
We introduce a new class of HPO method and explore how the low-rank factorization of intermediate layers of a convolutional network can be used to define an analytical response surface. We quantify how this surface behaves as a surrogate to model performance and can be solved using a trust-region search algorithm, which we call autoHyper.
arXiv Detail & Related papers (2021-11-28T05:27:34Z)
HYPPO: A Surrogate-Based Multi-Level Parallelism Tool for Hyperparameter Optimization [0.2844198651668139]
HYPPO uses adaptive surrogate models and accounts for uncertainty in model predictions to find accurate and reliable models that make robust predictions. We demonstrate various software features on time-series prediction and image classification problems as well as a scientific application in computed tomography image reconstruction.
arXiv Detail & Related papers (2021-10-04T20:14:22Z)
High-Dimensional Bayesian Optimization with Multi-Task Learning for RocksDB [0.0]
RocksDB is a general-purpose embedded key-value store. This paper investigates maximizing the throughput of RocksDB IO operations by auto-tuning ten parameters.
arXiv Detail & Related papers (2021-03-30T11:38:52Z)
Self-Tuning Stochastic Optimization with Curvature-Aware Gradient Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics. We prove that our model-based procedure converges in noisy gradient setting. This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z)
Weighting Is Worth the Wait: Bayesian Optimization with Importance Sampling [34.67740033646052]
We improve upon Bayesian optimization state-of-the-art runtime and final validation error across a variety of datasets and complex neural architectures. By learning a parameterization of IS that trades-off evaluation complexity and quality, we improve upon Bayesian optimization state-of-the-art runtime and final validation error across a variety of datasets and complex neural architectures.
arXiv Detail & Related papers (2020-02-23T15:52:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.