Related papers: ATA: Adaptive Task Allocation for Efficient Resource Management in Distributed Machine Learning

ATA: Adaptive Task Allocation for Efficient Resource Management in Distributed Machine Learning

URL: http://arxiv.org/abs/2502.00775v1
Date: Sun, 02 Feb 2025 12:22:26 GMT
Title: ATA: Adaptive Task Allocation for Efficient Resource Management in Distributed Machine Learning
Authors: Artavazd Maranjyan, El Mehdi Saad, Peter Richtárik, Francesco Orabona,
Abstract summary: Asynchronous methods are fundamental for parallelizing computations in distributed machine learning.<n>We propose ATA (Adaptive Task Allocation), a method that adapts to heterogeneous and random distributions of computation times.<n>We show that ATA identifies the optimal task allocation and performs comparably to methods with prior knowledge of computation times.
Score: 54.08906841213777
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Asynchronous methods are fundamental for parallelizing computations in distributed machine learning. They aim to accelerate training by fully utilizing all available resources. However, their greedy approach can lead to inefficiencies using more computation than required, especially when computation times vary across devices. If the computation times were known in advance, training could be fast and resource-efficient by assigning more tasks to faster workers. The challenge lies in achieving this optimal allocation without prior knowledge of the computation time distributions. In this paper, we propose ATA (Adaptive Task Allocation), a method that adapts to heterogeneous and random distributions of worker computation times. Through rigorous theoretical analysis, we show that ATA identifies the optimal task allocation and performs comparably to methods with prior knowledge of computation times. Experimental results further demonstrate that ATA is resource-efficient, significantly reducing costs compared to the greedy approach, which can be arbitrarily expensive depending on the number of workers.

Related papers

A Training Data Recipe to Accelerate A* Search with Language Models [3.037409201025504]
Large Language Models (LLMs) with search algorithms like A* holds the promise of enhanced reasoning and scalable inference. We empirically disentangle the requirements of A* search algorithm from the requirements of the LLM to generalise on this task. Our technique reduces the number of iterations required to find the solutions by up to 15x, with a wall-clock speed-up of search up to 5x.
arXiv Detail & Related papers (2024-07-13T19:21:44Z)
Optimizing Automatic Differentiation with Deep Reinforcement Learning [0.9353041869660692]
We present a novel method to optimize the number of necessary multiplications for Jacobian computation by leveraging deep reinforcement learning (RL)<n>We show that this method achieves up to 33% improvements over state-of-the-art methods on several relevant tasks taken from diverse domains.
arXiv Detail & Related papers (2024-06-07T15:44:33Z)
Switchable Decision: Dynamic Neural Generation Networks [98.61113699324429]
We propose a switchable decision to accelerate inference by dynamically assigning resources for each data instance. Our method benefits from less cost during inference while keeping the same accuracy.
arXiv Detail & Related papers (2024-05-07T17:44:54Z)
Sample Efficient Reinforcement Learning by Automatically Learning to Compose Subtasks [3.1594865504808944]
We propose an RL algorithm that automatically structure the reward function for sample efficiency, given a set of labels that signify subtasks. We evaluate our algorithm in a variety of sparse-reward environments.
arXiv Detail & Related papers (2024-01-25T15:06:40Z)
FAMO: Fast Adaptive Multitask Optimization [48.59232177073481]
We introduce Fast Adaptive Multitask Optimization FAMO, a dynamic weighting method that decreases task losses in a balanced way. Our results indicate that FAMO achieves comparable or superior performance to state-of-the-art gradient manipulation techniques.
arXiv Detail & Related papers (2023-06-06T15:39:54Z)
Fast and Straggler-Tolerant Distributed SGD with Reduced Computation Load [11.069252535469644]
optimization procedures like gradient descent (SGD) can be leveraged to mitigate the effect of unresponsive or slow workers called stragglers. This can be done by only waiting for a subset of the workers to finish their computation at each iteration of the algorithm. We construct a novel scheme that adapts both the number of workers and the computation load throughout the run-time of the algorithm.
arXiv Detail & Related papers (2023-04-17T20:12:18Z)
RSC: Accelerating Graph Neural Networks Training via Randomized Sparse Computations [56.59168541623729]
Training graph neural networks (GNNs) is time consuming because sparse graph-based operations are hard to be accelerated by hardware. We explore trading off the computational precision to reduce the time complexity via sampling-based approximation. We propose Randomized Sparse Computation, which for the first time demonstrate the potential of training GNNs with approximated operations.
arXiv Detail & Related papers (2022-10-19T17:25:33Z)
Efficient Distributed Machine Learning via Combinatorial Multi-Armed Bandits [23.289979018463406]
We consider a distributed gradient descent problem where a main node distributes gradient calculations among $n$ workers from which at most $b leq n$ can be utilized in parallel. By assigning tasks to all the workers and waiting only for the $k$ fastest ones, the main node can trade-off the error of the algorithm with its runtime by gradually increasing $k$ as the algorithm evolves. This strategy, referred to as adaptive k sync, can incur additional costs since it ignores the computational efforts of slow workers. We propose a cost-efficient scheme that assigns tasks only to $k$
arXiv Detail & Related papers (2022-02-16T19:18:19Z)
Coded Computation across Shared Heterogeneous Workers with Communication Delay [42.50248255900814]
We consider a multi-worker distributed computing scenario, where multiple matrix multiplication tasks are encoded and allocated to workers for parallel computation. We propose worker assignment, resource allocation load allocation algorithms under both dedicated and fractional worker assignment policies, where each worker can process the encoded encoded tasks. We show that the proposed algorithms can reduce the task delay completion compared to benchmarks, and observe that dedicated and fractional worker assignment policies have different scopes of applications.
arXiv Detail & Related papers (2021-09-23T09:40:54Z)
A Machine Learning Approach for Task and Resource Allocation in Mobile Edge Computing Based Networks [108.57859531628264]
A joint task, spectrum, and transmit power allocation problem is investigated for a wireless network. The proposed algorithm can reduce the number of iterations needed for convergence and the maximal delay among all users by up to 18% and 11.1% compared to the standard Q-learning algorithm.
arXiv Detail & Related papers (2020-07-20T13:46:42Z)
Coded Distributed Computing with Partial Recovery [56.08535873173518]
We introduce a novel coded matrix-vector multiplication scheme, called coded computation with partial recovery (CCPR) CCPR reduces both the computation time and the decoding complexity by allowing a trade-off between the accuracy and the speed of computation. We then extend this approach to distributed implementation of more general computation tasks by proposing a coded communication scheme with partial recovery.
arXiv Detail & Related papers (2020-07-04T21:34:49Z)
Hierarchical Reinforcement Learning as a Model of Human Task Interleaving [60.95424607008241]
We develop a hierarchical model of supervisory control driven by reinforcement learning. The model reproduces known empirical effects of task interleaving. The results support hierarchical RL as a plausible model of task interleaving.
arXiv Detail & Related papers (2020-01-04T17:53:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.