A Unified Cloud-Enabled Discrete Event Parallel and Distributed
Simulation Architecture
- URL: http://arxiv.org/abs/2302.11242v1
- Date: Wed, 22 Feb 2023 09:47:09 GMT
- Title: A Unified Cloud-Enabled Discrete Event Parallel and Distributed
Simulation Architecture
- Authors: Jos\'e L. Risco-Mart\'in, Kevin Henares, Saurabh Mittal, Luis F.
Almendras and Katzalin Olcoz
- Abstract summary: We present a unified parallel and distributed M&S architecture with enough flexibility to deploy simulations in the Cloud.
Our framework is based on the Discrete Event System Specification (DEVS) formalism.
The performance of the parallel and distributed framework is tested using the xDEVS M&S tool and the DEVStone benchmark with up to eight computing nodes.
- Score: 0.7949705607963994
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Cloud simulation environments today are largely employed to model and
simulate complex systems for remote accessibility and variable capacity
requirements. In this regard, scalability issues in Modeling and Simulation
(M\&S) computational requirements can be tackled through the elasticity of
on-demand Cloud deployment. However, implementing a high performance cloud M\&S
framework following these elastic principles is not a trivial task as
parallelizing and distributing existing architectures is challenging. Indeed,
both the parallel and distributed M\&S developments have evolved following
separate ways. Parallel solutions has always been focused on ad-hoc solutions,
while distributed approaches, on the other hand, have led to the definition of
standard distributed frameworks like the High Level Architecture (HLA) or
influenced the use of distributed technologies like the Message Passing
Interface (MPI). Only a few developments have been able to evolve with the
current resilience of computing hardware resources deployment, largely focused
on the implementation of Simulation as a Service (SaaS), albeit independently
of the parallel ad-hoc methods branch. In this paper, we present a unified
parallel and distributed M\&S architecture with enough flexibility to deploy
parallel and distributed simulations in the Cloud with a low effort, without
modifying the underlying model source code, and reaching important speedups
against the sequential simulation, especially in the parallel implementation.
Our framework is based on the Discrete Event System Specification (DEVS)
formalism. The performance of the parallel and distributed framework is tested
using the xDEVS M\&S tool, Application Programming Interface (API) and the
DEVStone benchmark with up to eight computing nodes, obtaining maximum speedups
of $15.95\times$ and $1.84\times$, respectively.
Related papers
- SeBS-Flow: Benchmarking Serverless Cloud Function Workflows [51.4200085836966]
We propose the first serverless workflow benchmarking suite SeBS-Flow.
SeBS-Flow includes six real-world application benchmarks and four microbenchmarks representing different computational patterns.
We conduct comprehensive evaluations on three major cloud platforms, assessing performance, cost, scalability, and runtime deviations.
arXiv Detail & Related papers (2024-10-04T14:52:18Z) - ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting.
atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput.
Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Distributed Compressed Sparse Row Format for Spiking Neural Network
Simulation, Serialization, and Interoperability [0.48733623015338234]
We discuss a parallel extension of a widely used format for efficiently representing sparse matrices, the compressed sparse row (CSR)
We contend that organizing additional network information, such as neuron and synapse state, in alignment with its adjacency as dCSR provides a straightforward partition-based distribution of network state.
We provide a potential implementation, and put it forward for adoption within the neural computing community.
arXiv Detail & Related papers (2023-04-12T03:19:06Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Parallel Simulation of Quantum Networks with Distributed Quantum State
Management [56.24769206561207]
We identify requirements for parallel simulation of quantum networks and develop the first parallel discrete event quantum network simulator.
Our contributions include the design and development of a quantum state manager that maintains shared quantum information distributed across multiple processes.
We release the parallel SeQUeNCe simulator as an open-source tool alongside the existing sequential version.
arXiv Detail & Related papers (2021-11-06T16:51:17Z) - Device Scheduling and Update Aggregation Policies for Asynchronous
Federated Learning [72.78668894576515]
Federated Learning (FL) is a newly emerged decentralized machine learning (ML) framework.
We propose an asynchronous FL framework with periodic aggregation to eliminate the straggler issue in FL systems.
arXiv Detail & Related papers (2021-07-23T18:57:08Z) - Reinforcement Learning on Computational Resource Allocation of
Cloud-based Wireless Networks [22.06811314358283]
Wireless networks used for Internet of Things (IoT) are expected to largely involve cloud-based computing and processing.
In a cloud environment, dynamic computational resource allocation is essential to save energy while maintaining the performance of the processes.
This paper models this dynamic computational resource allocation problem into a Markov Decision Process (MDP) and designs a model-based reinforcement-learning agent to optimise the dynamic resource allocation of the CPU usage.
The results show that our agent rapidly converges to the optimal policy, stably performs in different settings, outperforms or at least equally performs compared to a baseline algorithm in energy savings for different scenarios.
arXiv Detail & Related papers (2020-10-10T15:16:26Z) - Deep Generative Models that Solve PDEs: Distributed Computing for
Training Large Data-Free Models [25.33147292369218]
Recent progress in scientific machine learning (SciML) has opened up the possibility of training novel neural network architectures that solve complex partial differential equations (PDEs)
Here we report on a software framework for data parallel distributed deep learning that resolves the twin challenges of training these large SciML models.
Our framework provides several out of the box functionality including (a) loss integrity independent of number of processes, (b) synchronized batch normalization, and (c) distributed higher-order optimization methods.
arXiv Detail & Related papers (2020-07-24T22:42:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.