Federated Learning Framework for Scalable AI in Heterogeneous HPC and Cloud Environments
- URL: http://arxiv.org/abs/2511.19479v1
- Date: Sat, 22 Nov 2025 18:39:25 GMT
- Title: Federated Learning Framework for Scalable AI in Heterogeneous HPC and Cloud Environments
- Authors: Sangam Ghimire, Paribartan Timalsina, Nirjal Bhurtel, Bishal Neupane, Bigyan Byanju Shrestha, Subarna Bhattarai, Prajwal Gaire, Jessica Thapa, Sudan Jha,
- Abstract summary: We present a federated learning framework built to run efficiently across mixed HPC and cloud environments.<n>Our system addresses key challenges such as system het- erogeneity, communication overhead, and resource scheduling, while maintaining model accuracy and data privacy.
- Score: 0.1805840413757548
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As the demand grows for scalable and privacy-aware AI systems, Federated Learning (FL) has emerged as a promising solution, allowing decentralized model training without moving raw data. At the same time, the combination of high- performance computing (HPC) and cloud infrastructure offers vast computing power but introduces new complexities, especially when dealing with heteroge- neous hardware, communication limits, and non-uniform data. In this work, we present a federated learning framework built to run efficiently across mixed HPC and cloud environments. Our system addresses key challenges such as system het- erogeneity, communication overhead, and resource scheduling, while maintaining model accuracy and data privacy. Through experiments on a hybrid testbed, we demonstrate strong performance in terms of scalability, fault tolerance, and convergence, even under non-Independent and Identically Distributed (non-IID) data distributions and varied hardware. These results highlight the potential of federated learning as a practical approach to building scalable Artificial Intelligence (AI) systems in modern, distributed computing settings.
Related papers
- A Survey on Cloud-Edge-Terminal Collaborative Intelligence in AIoT Networks [49.90474228895655]
Cloud-edge-terminal collaborative intelligence (CETCI) is a fundamental paradigm within the artificial intelligence of things (AIoT) community.<n>CETCI has made significant progress with emerging AIoT applications, moving beyond isolated layer optimization to deployable collaborative intelligence systems.<n>This survey describes foundational architectures, enabling technologies, and scenarios of CETCI paradigms, offering a tutorial-style review for CISAIOT beginners.
arXiv Detail & Related papers (2025-08-26T08:38:01Z) - Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey [58.50944604905037]
Edge-cloud collaborative computing (ECCC) has emerged as a pivotal paradigm for addressing the computational demands of modern intelligent applications.<n>Recent advancements in AI, particularly deep learning and large language models (LLMs), have dramatically enhanced the capabilities of these distributed systems.<n>This survey provides a structured tutorial on fundamental architectures, enabling technologies, and emerging applications.
arXiv Detail & Related papers (2025-05-03T13:55:38Z) - Model Agnostic Hybrid Sharding For Heterogeneous Distributed Inference [11.39873199479642]
Nesa introduces a model-agnostic sharding framework designed for decentralized AI inference.
Our framework uses blockchain-based deep neural network sharding to distribute computational tasks across a diverse network of nodes.
Our results highlight the potential to democratize access to cutting-edge AI technologies.
arXiv Detail & Related papers (2024-07-29T08:18:48Z) - Generative AI like ChatGPT in Blockchain Federated Learning: use cases, opportunities and future [4.497001527881303]
This research explores potential integrations of generative AI in federated learning.
generative adversarial networks (GANs) and variational autoencoders (VAEs)
Generating synthetic data helps federated learning address challenges related to limited data availability.
arXiv Detail & Related papers (2024-07-25T19:43:49Z) - FedSR: A Semi-Decentralized Federated Learning Algorithm for Non-IIDness in IoT System [2.040586739710704]
In the Industrial Internet of Things (IoT), a large amount of data will be generated every day.
Due to privacy and security issues, it is difficult to collect all these data together to train deep learning models.
In this paper, we combine centralized federated learning with decentralized federated learning to design a semi-decentralized cloud-edge-device hierarchical federated learning framework.
arXiv Detail & Related papers (2024-03-19T09:34:01Z) - Effective Intrusion Detection in Heterogeneous Internet-of-Things Networks via Ensemble Knowledge Distillation-based Federated Learning [52.6706505729803]
We introduce Federated Learning (FL) to collaboratively train a decentralized shared model of Intrusion Detection Systems (IDS)
FLEKD enables a more flexible aggregation method than conventional model fusion techniques.
Experiment results show that the proposed approach outperforms local training and traditional FL in terms of both speed and performance.
arXiv Detail & Related papers (2024-01-22T14:16:37Z) - Coordination-free Decentralised Federated Learning on Complex Networks:
Overcoming Heterogeneity [2.6849848612544]
Federated Learning (FL) is a framework for performing a learning task in an edge computing scenario.
We propose a communication-efficient Decentralised Federated Learning (DFL) algorithm able to cope with them.
Our solution allows devices communicating only with their direct neighbours to train an accurate model.
arXiv Detail & Related papers (2023-12-07T18:24:19Z) - Federated Learning-Empowered AI-Generated Content in Wireless Networks [58.48381827268331]
Federated learning (FL) can be leveraged to improve learning efficiency and achieve privacy protection for AIGC.
We present FL-based techniques for empowering AIGC, and aim to enable users to generate diverse, personalized, and high-quality content.
arXiv Detail & Related papers (2023-07-14T04:13:11Z) - FedILC: Weighted Geometric Mean and Invariant Gradient Covariance for
Federated Learning on Non-IID Data [69.0785021613868]
Federated learning is a distributed machine learning approach which enables a shared server model to learn by aggregating the locally-computed parameter updates with the training data from spatially-distributed client silos.
We propose the Federated Invariant Learning Consistency (FedILC) approach, which leverages the gradient covariance and the geometric mean of Hessians to capture both inter-silo and intra-silo consistencies.
This is relevant to various fields such as medical healthcare, computer vision, and the Internet of Things (IoT)
arXiv Detail & Related papers (2022-05-19T03:32:03Z) - Federated Stochastic Gradient Descent Begets Self-Induced Momentum [151.4322255230084]
Federated learning (FL) is an emerging machine learning method that can be applied in mobile edge systems.
We show that running to the gradient descent (SGD) in such a setting can be viewed as adding a momentum-like term to the global aggregation process.
arXiv Detail & Related papers (2022-02-17T02:01:37Z) - Integrating Deep Learning in Domain Sciences at Exascale [2.241545093375334]
We evaluate existing packages for their ability to run deep learning models and applications on large-scale HPC systems efficiently.
We propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems.
We present illustrations and potential solutions for enhancing traditional compute- and data-intensive applications with AI.
arXiv Detail & Related papers (2020-11-23T03:09:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.