SecureInfer: Heterogeneous TEE-GPU Architecture for Privacy-Critical Tensors for Large Language Model Deployment
- URL: http://arxiv.org/abs/2510.19979v1
- Date: Wed, 22 Oct 2025 19:17:31 GMT
- Title: SecureInfer: Heterogeneous TEE-GPU Architecture for Privacy-Critical Tensors for Large Language Model Deployment
- Authors: Tushar Nayan, Ziqi Zhang, Ruimin Sun,
- Abstract summary: SecureInfer is a framework that offloads compute-intensive operations to untrusted accelerators.<n>We implement a prototype of SecureInfer using the LLaMA-2 model and evaluate it across performance and security metrics.
- Score: 9.666696979829359
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the increasing deployment of Large Language Models (LLMs) on mobile and edge platforms, securing them against model extraction attacks has become a pressing concern. However, protecting model privacy without sacrificing the performance benefits of untrusted AI accelerators, such as GPUs, presents a challenging trade-off. In this paper, we initiate the study of high-performance execution on LLMs and present SecureInfer, a hybrid framework that leverages a heterogeneous Trusted Execution Environments (TEEs)-GPU architecture to isolate privacy-critical components while offloading compute-intensive operations to untrusted accelerators. Building upon an outsourcing scheme, SecureInfer adopts an information-theoretic and threat-informed partitioning strategy: security-sensitive components, including non-linear layers, projection of attention head, FNN transformations, and LoRA adapters, are executed inside an SGX enclave, while other linear operations (matrix multiplication) are performed on the GPU after encryption and are securely restored within the enclave. We implement a prototype of SecureInfer using the LLaMA-2 model and evaluate it across performance and security metrics. Our results show that SecureInfer offers strong security guarantees with reasonable performance, offering a practical solution for secure on-device model inference.
Related papers
- CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents [60.98294016925157]
AI agents are vulnerable to prompt injection attacks, where malicious content hijacks agent behavior to steal credentials or cause financial loss.<n>We introduce Single-Shot Planning for CUAs, where a trusted planner generates a complete execution graph with conditional branches before any observation of potentially malicious content.<n>Although this architectural isolation successfully prevents instruction injections, we show that additional measures are needed to prevent Branch Steering attacks.
arXiv Detail & Related papers (2026-01-14T23:06:35Z) - OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows [77.95511352806261]
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms.<n>We propose OS-Sentinel, a novel hybrid safety detection framework that combines a Formal Verifier for detecting explicit system-level violations with a Contextual Judge for assessing contextual risks and agent actions.
arXiv Detail & Related papers (2025-10-28T13:22:39Z) - Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - Securing Transformer-based AI Execution via Unified TEEs and Crypto-protected Accelerators [19.93096649006403]
Machine learning runs on untrusted cloud infrastructure, exposing data and models to potential breaches.<n>Running model inference entirely within trusted execution environments (TEEs) is subject to non-trivial slowdown.<n>We propose TwinShield, a framework enabling secure Transformer inference in heterogeneous TEE and accelerator systems.
arXiv Detail & Related papers (2025-07-04T03:52:53Z) - An Early Experience with Confidential Computing Architecture for On-Device Model Protection [6.024889136631505]
Arm Confidential Computing Architecture (CCA) is a new Arm extension for on-device machine learning (ML)<n>In this paper, we evaluate the performance-privacy trade-offs of deploying models within CCA.<n>Our framework can successfully protect the model against membership inference attack by an 8.3% reduction in the adversary's success rate.
arXiv Detail & Related papers (2025-04-11T13:21:33Z) - TEESlice: Protecting Sensitive Neural Network Models in Trusted Execution Environments When Attackers have Pre-Trained Models [12.253529209143197]
TSDP is a method that protects privacy-sensitive weights within TEEs and offloads insensitive weights to GPUs.
We introduce a novel partition before training strategy, which effectively separates privacy-sensitive weights from other components of the model.
Our evaluation demonstrates that our approach can offer full model protection with a computational cost reduced by a factor of 10.
arXiv Detail & Related papers (2024-11-15T04:52:11Z) - CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment [66.72332011814183]
CoreGuard is a computation- and communication-efficient protection method for proprietary large language models (LLMs) deployed on edge devices.<n> CoreGuard employs an efficient protection protocol to reduce computational overhead and minimize communication overhead via a propagation protocol.
arXiv Detail & Related papers (2024-10-16T08:14:24Z) - ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2.
Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z) - SLIP: Securing LLMs IP Using Weights Decomposition [0.0]
Large language models (LLMs) have recently seen widespread adoption, in both academia and industry.
As these models grow, they become valuable intellectual property (IP), reflecting enormous investments by their owners.
Current methods to protect models' IP on the edge have limitations in terms of practicality, loss in accuracy, or suitability to requirements.
We introduce a novel hybrid inference algorithm, named SLIP, designed to protect edge-deployed models from theft.
arXiv Detail & Related papers (2024-07-15T16:37:55Z) - Privacy preserving layer partitioning for Deep Neural Network models [0.21470800327528838]
Trusted Execution Environments (TEEs) can introduce significant performance overhead due to additional layers of encryption, decryption, security and integrity checks.
We introduce layer partitioning technique and offloading computations to GPU.
We conduct experiments to demonstrate the effectiveness of our approach in protecting against input reconstruction attacks developed using trained conditional Generative Adversarial Network(c-GAN)
arXiv Detail & Related papers (2024-04-11T02:39:48Z) - RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks.
This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.