Related papers: Plinius: Secure and Persistent Machine Learning Model Training

Plinius: Secure and Persistent Machine Learning Model Training

URL: http://arxiv.org/abs/2104.02987v2
Date: Thu, 8 Apr 2021 06:03:57 GMT
Title: Plinius: Secure and Persistent Machine Learning Model Training
Authors: Peterson Yuhala, Pascal Felber, Valerio Schiavoni, Alain Tchana
Abstract summary: Persistent memory (PM) is resilient to power loss (unlike DRAM) We present PLINIUS, a framework using Intel SGX enclaves for secure training of ML models and PM for fault tolerance guarantees.
Score: 2.1375296464337086
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the increasing popularity of cloud based machine learning (ML) techniques there comes a need for privacy and integrity guarantees for ML data. In addition, the significant scalability challenges faced by DRAM coupled with the high access-times of secondary storage represent a huge performance bottleneck for ML systems. While solutions exist to tackle the security aspect, performance remains an issue. Persistent memory (PM) is resilient to power loss (unlike DRAM), provides fast and fine-granular access to memory (unlike disk storage) and has latency and bandwidth close to DRAM (in the order of ns and GB/s, respectively). We present PLINIUS, a ML framework using Intel SGX enclaves for secure training of ML models and PM for fault tolerance guarantees. P LINIUS uses a novel mirroring mechanism to create and maintain (i) encrypted mirror copies of ML models on PM, and (ii) encrypted training data in byte-addressable PM, for near-instantaneous data recovery after a system failure. Compared to disk-based checkpointing systems,PLINIUS is 3.2x and 3.7x faster respectively for saving and restoring models on real PM hardware, achieving robust and secure ML model training in SGX enclaves.

Related papers

ROMA: a Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM [4.810245343774232]
ROMA is a hybrid storage architecture that uses ROM for quantized base models and LoRA weights and KV cache. LoRA modules offer the flexibility to adapt to new data without requiring updates to the base model. ROMA can effectively store both a 4-bit 3B and a 2-bit 8B LLaMA model entirely on-chip, achieving a notable generation speed exceeding 20,000 tokens/s without requiring external memory.
arXiv Detail & Related papers (2025-03-17T09:44:17Z)
PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing [48.30406812516552]
We introduce the PLM, a Peripheral Language Model, developed through a co-design process that jointly optimize model architecture and edge system constraints. PLM employs a Multi-head Latent Attention mechanism and employs the squared ReLU activation function to encourage sparsity, thereby reducing peak memory footprint. evaluation results demonstrate that PLM outperforms existing small language models trained on publicly available data.
arXiv Detail & Related papers (2025-03-15T15:11:17Z)
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM. DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z)
Mixture of Attentions For Speculative Decoding [17.344416130742232]
Speculative decoding (SD) leverages smaller models to efficiently propose future tokens, which are then verified by the Large Language Models in parallel. We identify several limitations of SD models including the lack of on-policyness during training and partial observability. We propose a more grounded architecture for small models by introducing a Mixture of Attentions for SD.
arXiv Detail & Related papers (2024-10-04T10:25:52Z)
MiniCPM-V: A GPT-4V Level MLLM on Your Phone [83.10007643273521]
MiniCPM-V is a series of efficient MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, MiniCPM-V 2.5 has several notable features. MiniCPM-V can be viewed as a representative example of a promising trend.
arXiv Detail & Related papers (2024-08-03T15:02:21Z)
SLIP: Securing LLMs IP Using Weights Decomposition [0.0]
Large language models (LLMs) have recently seen widespread adoption, in both academia and industry. As these models grow, they become valuable intellectual property (IP), reflecting enormous investments by their owners. Current methods to protect models' IP on the edge have limitations in terms of practicality, loss in accuracy, or suitability to requirements. We introduce a novel hybrid inference algorithm, named SLIP, designed to protect edge-deployed models from theft.
arXiv Detail & Related papers (2024-07-15T16:37:55Z)
PermLLM: Private Inference of Large Language Models within 3 Seconds under WAN [19.014325509263536]
ChatGPT marks the arrival of the large language model (LLM) era. PermLLM achieves two-party private inference of the ChatGLM-6B model at the speed of around 3s/token.
arXiv Detail & Related papers (2024-05-29T04:06:50Z)
MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory [49.96019697955383]
We introduce MemLLM, a novel method of enhancing knowledge capabilities by integrating a structured and explicit read-and-write memory module. Our experiments indicate that MemLLM enhances performance and interpretability, in language modeling general and in particular. We see MemLLM as an important step towards making LLMs more grounded and factual through memory augmentation.
arXiv Detail & Related papers (2024-04-17T18:13:16Z)
AI and Memory Wall [81.06494558184049]
We show how memory bandwidth can become the dominant bottleneck for decoder models. We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
arXiv Detail & Related papers (2024-03-21T04:31:59Z)
Online Adaptation of Language Models with a Memory of Amortized Contexts [82.02369596879817]
Memory of Amortized Contexts (MAC) is an efficient and effective online adaptation framework for large language models. We show how MAC can be combined with and improve the performance of popular alternatives such as retrieval augmented generations.
arXiv Detail & Related papers (2024-03-07T08:34:57Z)
LLM in a flash: Efficient Large Language Model Inference with Limited Memory [19.668719251238176]
Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory.
arXiv Detail & Related papers (2023-12-12T18:57:08Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
S3ML: A Secure Serving System for Machine Learning Inference [15.994551402176189]
We present S3ML, a secure serving system for machine learning inference. S3ML runs machine learning models in Intel SGX enclaves to protect users' privacy.
arXiv Detail & Related papers (2020-10-13T07:41:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.