Related papers: Chat AI: A Seamless Slurm-Native Solution for HPC-Based Services

Chat AI: A Seamless Slurm-Native Solution for HPC-Based Services

URL: http://arxiv.org/abs/2407.00110v2
Date: Fri, 2 Aug 2024 15:34:22 GMT
Title: Chat AI: A Seamless Slurm-Native Solution for HPC-Based Services
Authors: Ali Doosthosseini, Jonathan Decker, Hendrik Nolte, Julian M. Kunkel,
Abstract summary: Large language models (LLMs) allow researchers to run open source or custom fine-tuned LLMs and ensure users that their data remains private and is not stored without their consent. We propose an implementation consisting of a web service that runs on a cloud VM with secure access to a scalable backend running a multitude of LLM models on HPC systems. Our solution integrates with the HPC batch scheduler Slurm, enabling seamless deployment on HPC clusters, and is able to run side by side with regular Slurm workloads.
Score: 0.3124884279860061
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The widespread adoption of large language models (LLMs) has created a pressing need for an efficient, secure and private serving infrastructure, which allows researchers to run open source or custom fine-tuned LLMs and ensures users that their data remains private and is not stored without their consent. While high-performance computing (HPC) systems equipped with state-of-the-art GPUs are well-suited for training LLMs, their batch scheduling paradigm is not designed to support real-time serving of AI applications. Cloud systems, on the other hand, are well suited for web services but commonly lack access to the computational power of HPC clusters, especially expensive and scarce high-end GPUs, which are required for optimal inference speed. We propose an architecture with an implementation consisting of a web service that runs on a cloud VM with secure access to a scalable backend running a multitude of LLM models on HPC systems. By offering a web service using our HPC infrastructure to host LLMs, we leverage the trusted environment of local universities and research centers to offer a private and secure alternative to commercial LLM services. Our solution natively integrates with the HPC batch scheduler Slurm, enabling seamless deployment on HPC clusters, and is able to run side by side with regular Slurm workloads, while utilizing gaps in the schedule created by Slurm. In order to ensure the security of the HPC system, we use the SSH ForceCommand directive to construct a robust circuit breaker, which prevents successful attacks on the web-facing server from affecting the cluster. We have successfully deployed our system as a production service, and made the source code available at \url{https://github.com/gwdg/chat-ai}

Related papers

ULTHO: Ultra-Lightweight yet Efficient Hyperparameter Optimization in Deep Reinforcement Learning [50.53705050673944]
We propose ULTHO, an ultra-lightweight yet powerful framework for fast HPO in deep RL within single runs. Specifically, we formulate the HPO process as a multi-armed bandit with clustered arms (MABC) and link it directly to long-term return optimization. We test ULTHO on benchmarks including ALE, Procgen, MiniGrid, and PyBullet.
arXiv Detail & Related papers (2025-03-08T07:03:43Z)
LLM as HPC Expert: Extending RAG Architecture for HPC Data [0.058520770038704165]
This paper introduces Hypothetical Command Embeddings (HyCE), a novel method that extends Retrieval-Augmented Generation (RAG) HyCE enriches large language models (LLM) with real-time, user-specific HPC information, addressing the limitations of fine-tuned models on such data. We tackle essential security concerns, including data privacy and command execution risks, associated with deploying LLMs in HPC environments.
arXiv Detail & Related papers (2024-12-09T02:55:30Z)
Lightweight, Secure and Stateful Serverless Computing with PSL [43.025002382616066]
We present Function-as-a-Serivce (F) framework for Trusted Execution Environments (TEEs) The framework provides rich programming language support on heterogeneous TEE hardware for statically compiled binaries and/or WebAssembly (WASM) bytecodes. It achieves near-native execution speeds by utilizing the dynamic memory mapping capabilities of Intel SGX2.
arXiv Detail & Related papers (2024-10-25T23:17:56Z)
Safely Learning with Private Data: A Federated Learning Framework for Large Language Model [3.1077263218029105]
Federated learning (FL) is an ideal solution for training models with distributed private data. Traditional frameworks like FedAvg are unsuitable for large language models (LLM) We propose FL-GLM, which prevents data leakage caused by both server-side and peer-client attacks.
arXiv Detail & Related papers (2024-06-21T06:43:15Z)
FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication [2.1301190271783317]
We present FSD-Inference, the first fully serverless and highly scalable system for distributed ML inference. We introduce novel fully serverless communication schemes for ML inference workloads, leveraging both cloud-based publish-subscribe/queueing and object storage offerings.
arXiv Detail & Related papers (2024-03-22T13:31:24Z)
HasTEE+ : Confidential Cloud Computing and Analytics with Haskell [50.994023665559496]
Confidential computing enables the protection of confidential code and data in a co-tenanted cloud deployment using specialized hardware isolation units called Trusted Execution Environments (TEEs) TEEs offer low-level C/C++-based toolchains that are susceptible to inherent memory safety vulnerabilities and lack language constructs to monitor explicit and implicit information-flow leaks. We address the above with HasTEE+, a domain-specific language (cla) embedded in Haskell that enables programming TEEs in a high-level language with strong type-safety.
arXiv Detail & Related papers (2024-01-17T00:56:23Z)
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size. These models require high-end hardware, making them inaccessible to most researchers. We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z)
Putting a Padlock on Lambda -- Integrating vTPMs into AWS Firecracker [49.1574468325115]
Software services place implicit trust in the cloud provider, without an explicit trust relationship. There is currently no cloud provider that exposes Trusted Platform Module capabilities. We improve trust by integrating a virtual TPM device into the Firecracker, originally developed by Amazon Web Services.
arXiv Detail & Related papers (2023-10-05T13:13:55Z)
Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems. We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
MLProxy: SLA-Aware Reverse Proxy for Machine Learning Inference Serving on Serverless Computing Platforms [5.089110111757978]
Serving machine learning inference workloads on the cloud is still a challenging task on the production level. Serverless computing has emerged in recent years to automate most infrastructure management tasks. We present ML Proxy, a reverse proxy to support efficient machine learning serving workloads on serverless computing systems.
arXiv Detail & Related papers (2022-02-23T00:27:49Z)
Secure Platform for Processing Sensitive Data on Shared HPC Systems [0.0]
High performance computing clusters pose challenges for processing sensitive data. In this work we present a novel method for creating secure computing environments on traditional multi-tenant high-performance computing clusters.
arXiv Detail & Related papers (2021-03-26T18:30:33Z)
A Privacy-Preserving Distributed Architecture for Deep-Learning-as-a-Service [68.84245063902908]
This paper introduces a novel distributed architecture for deep-learning-as-a-service. It is able to preserve the user sensitive data while providing Cloud-based machine and deep learning services.
arXiv Detail & Related papers (2020-03-30T15:12:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.