Related papers: Asynchronous Decentralized Bayesian Optimization for Large Scale Hyperparameter Optimization

Asynchronous Decentralized Bayesian Optimization for Large Scale Hyperparameter Optimization

URL: http://arxiv.org/abs/2207.00479v4
Date: Tue, 26 Sep 2023 07:02:28 GMT
Title: Asynchronous Decentralized Bayesian Optimization for Large Scale Hyperparameter Optimization
Authors: Romain Egele, Isabelle Guyon, Venkatram Vishwanath, Prasanna Balaprakash
Abstract summary: In BO, a computationally cheap surrogate model is employed to learn the relationship between parameter configurations and their performance. We present an asynchronous-decentralized BO, wherein each worker runs a sequential BO and asynchronously communicates its results through shared storage. We scale our method without loss of computational efficiency with above 95% of worker's utilization to 1,920 parallel workers.
Score: 13.89136187674851
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Bayesian optimization (BO) is a promising approach for hyperparameter optimization of deep neural networks (DNNs), where each model training can take minutes to hours. In BO, a computationally cheap surrogate model is employed to learn the relationship between parameter configurations and their performance such as accuracy. Parallel BO methods often adopt single manager/multiple workers strategies to evaluate multiple hyperparameter configurations simultaneously. Despite significant hyperparameter evaluation time, the overhead in such centralized schemes prevents these methods to scale on a large number of workers. We present an asynchronous-decentralized BO, wherein each worker runs a sequential BO and asynchronously communicates its results through shared storage. We scale our method without loss of computational efficiency with above 95% of worker's utilization to 1,920 parallel workers (full production queue of the Polaris supercomputer) and demonstrate improvement in model accuracy as well as faster convergence on the CANDLE benchmark from the Exascale computing project.

Related papers

AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism [54.8494905524997]
We introduce asynchronous updates across both parallelism axes, relaxing the co-location requirement.<n>We provide convergence guarantees for both sparse averaging and asynchronous updates.<n>Experiments on large-scale language models demonstrate that our approach matches the performance of the fully synchronous baseline.
arXiv Detail & Related papers (2026-01-30T01:24:47Z)
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z)
Ringleader ASGD: The First Asynchronous SGD with Optimal Time Complexity under Data Heterogeneity [51.56484100374058]
We introduce Ringleader ASGD, the first asynchronous algorithm that attains the theoretical lower bounds for parallel computation.<n>Our analysis further establishes that Ringleader ASGD remains optimal under arbitrary gradient and even time-varying speeds.
arXiv Detail & Related papers (2025-09-26T19:19:15Z)
Nesterov Method for Asynchronous Pipeline Parallel Optimization [59.79227116582264]
We introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in Pipeline Parallelism.<n>Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients.<n>We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients.
arXiv Detail & Related papers (2025-05-02T08:23:29Z)
Resource-Adaptive Successive Doubling for Hyperparameter Optimization with Large Datasets on High-Performance Computing Systems [0.4334105740533729]
This article proposes a novel Resource-Adaptive Successive Doubling Algorithm (RASDA) It combines a resource-adaptive successive doubling scheme with the plain Asynchronous Successive Halving Algorithm (ASHA) It is applied to different types of Neural Networks (NNs) and trained on large datasets from the Computer Vision (CV), Computational Fluid Dynamics (CFD, and Additive Manufacturing (AM) domains.
arXiv Detail & Related papers (2024-12-03T11:25:48Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes. Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices [0.0]
Ravnest facilitates decentralized training by efficiently organizing compute nodes into clusters. We have framed our asynchronous SGD loss function as a block structured optimization problem with delayed updates.
arXiv Detail & Related papers (2024-01-03T13:07:07Z)
Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates [28.813671194939225]
fully decentralized optimization methods have been advocated as alternatives to the popular parameter server framework. We propose a fully decentralized algorithm with adaptive asynchronous updates via adaptively determining the number of neighbor workers for each worker to communicate with. We show that DSGD-AAU achieves a linear speedup for convergence and demonstrate its effectiveness via extensive experiments.
arXiv Detail & Related papers (2023-06-11T02:08:59Z)
Massively Parallel Genetic Optimization through Asynchronous Propagation of Populations [50.591267188664666]
Propulate is an evolutionary optimization algorithm and software package for global optimization. We provide an MPI-based implementation of our algorithm, which features variants of selection, mutation, crossover, and migration. We find that Propulate is up to three orders of magnitude faster without sacrificing solution accuracy.
arXiv Detail & Related papers (2023-01-20T18:17:34Z)
Asynchronous Parallel Incremental Block-Coordinate Descent for Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing. For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data. This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z)
Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles. Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center. We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes. A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z)
Restructuring, Pruning, and Adjustment of Deep Models for Parallel Distributed Inference [15.720414948573753]
We consider the parallel implementation of an already-trained deep model on multiple processing nodes (a.k.a. workers) We propose RePurpose, a layer-wise model restructuring and pruning technique that guarantees the performance of the overall parallelized model. We show that, compared to the existing methods, RePurpose significantly improves the efficiency of the distributed inference via parallel implementation.
arXiv Detail & Related papers (2020-08-19T06:44:41Z)
Simple and Scalable Parallelized Bayesian Optimization [2.512827436728378]
We propose a simple and scalable BO method for asynchronous parallel settings. Experiments are carried out with a benchmark function and hyperparameter optimization of multi-layer perceptrons.
arXiv Detail & Related papers (2020-06-24T10:25:27Z)
Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network. Our model requires a much less number of communication rounds and still a number of communication rounds in theory. Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.