TZ-LLM: Protecting On-Device Large Language Models with Arm TrustZone
- URL: http://arxiv.org/abs/2511.13717v1
- Date: Mon, 17 Nov 2025 18:59:20 GMT
- Title: TZ-LLM: Protecting On-Device Large Language Models with Arm TrustZone
- Authors: Xunjie Wang, Jiacheng Shi, Zihan Zhao, Yang Yu, Zhichao Hua, Jinyu Gu,
- Abstract summary: Large Language Models (LLMs) deployed on mobile devices offer benefits like user privacy and reduced network latency, but introduce a significant security risk.<n>We propose a system design for protecting on-device LLMs using Arm Trusted Execution Environment (TEE), TrustZone.
- Score: 8.538298365840877
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) deployed on mobile devices offer benefits like user privacy and reduced network latency, but introduce a significant security risk: the leakage of proprietary models to end users. To mitigate this risk, we propose a system design for protecting on-device LLMs using Arm Trusted Execution Environment (TEE), TrustZone. Our system addresses two primary challenges: (1) The dilemma between memory efficiency and fast inference (caching model parameters within TEE memory). (2) The lack of efficient and secure Neural Processing Unit (NPU) time-sharing between Rich Execution Environment (REE) and TEE. Our approach incorporates two key innovations. First, we employ pipelined restoration, leveraging the deterministic memory access patterns of LLM inference to prefetch parameters on demand, hiding memory allocation, I/O and decryption latency under computation time. Second, we introduce a co-driver design, creating a minimal data plane NPU driver in the TEE that collaborates with the full-fledged REE driver. This reduces the TEE TCB size and eliminates control plane reinitialization overhead during NPU world switches. We implemented our system on the emerging OpenHarmony OS and the llama.cpp inference framework, and evaluated it with various LLMs on an Arm Rockchip device. Compared to a strawman TEE baseline lacking our optimizations, our system reduces TTFT by up to 90.9% and increases decoding speed by up to 23.2%.
Related papers
- Blockchain-Enabled Routing for Zero-Trust Low-Altitude Intelligent Networks [77.17664010626726]
We focus on the routing with multiple UAV clusters in low-altitude intelligent networks (LAINs)<n>To minimize the damage caused by potential threats, we present the zero-trust architecture with the software-defined perimeter and blockchain techniques.<n>We show that the proposed framework reduces the average E2E delay by 59% and improves the TSR by 29% on average compared to benchmarks.
arXiv Detail & Related papers (2026-02-27T04:30:35Z) - HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network [50.33808558714122]
Large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy.<n>We propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network.<n> Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions.
arXiv Detail & Related papers (2026-01-16T07:37:23Z) - Amulet: Fast TEE-Shielded Inference for On-Device Model Protection [15.936694312917512]
On-device machine learning (ML) introduces new security concerns about model privacy.<n> Storing valuable trained ML models on user devices exposes them to potential extraction by adversaries.<n>We propose Amulet, a fast TEE-shielded on-device inference framework for ML model protection.
arXiv Detail & Related papers (2025-12-08T12:22:51Z) - Silentflow: Leveraging Trusted Execution for Resource-Limited MPC via Hardware-Algorithm Co-design [6.998260344481881]
We introduce Silentflow, a protocol that eliminates communication in COT generation.<n>We balance end-to-end latency and resource demands, achieving up to 39.51x speedup over state-of-the-art protocols.
arXiv Detail & Related papers (2025-08-18T21:00:10Z) - Digital Twin-Assisted Federated Learning with Blockchain in Multi-tier Computing Systems [67.14406100332671]
In Industry 4.0 systems, resource-constrained edge devices engage in frequent data interactions.
This paper proposes a digital twin (DT) and federated digital twin (FL) scheme.
The efficacy of our proposed cooperative interference-based FL process has been verified through numerical analysis.
arXiv Detail & Related papers (2024-11-04T17:48:02Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Memory-Efficient and Secure DNN Inference on TrustZone-enabled Consumer IoT Devices [9.928745904761358]
Edge intelligence enables resource-demanding Deep Neural Network (DNN) inference without transferring original data.
For privacy-sensitive applications, deploying models in hardware-isolated trusted execution environments (TEEs) becomes essential.
We present a novel approach for advanced model deployment in TrustZone that ensures comprehensive privacy preservation during model inference.
arXiv Detail & Related papers (2024-03-19T09:22:50Z) - Practical Conformer: Optimizing size, speed and flops of Conformer for
on-Device and cloud ASR [67.63332492134332]
We design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs.
Our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline.
arXiv Detail & Related papers (2023-03-31T23:30:48Z) - RRNet: Towards ReLU-Reduced Neural Network for Two-party Computation
Based Private Inference [17.299835585861747]
We introduce RRNet, a framework that aims to jointly reduce the overhead of MPC comparison protocols and accelerate computation through hardware acceleration.
Our approach integrates the hardware latency of cryptographic building blocks into the DNN loss function, resulting in improved energy efficiency, accuracy, and security guarantees.
arXiv Detail & Related papers (2023-02-05T04:02:13Z) - PolyMPCNet: Towards ReLU-free Neural Architecture Search in Two-party
Computation Based Private Inference [23.795457990555878]
Secure multi-party computation (MPC) has been discussed, to enable the privacy-preserving deep learning (DL) computation.
MPCs often come at very high computation overhead, and potentially prohibit their popularity in large scale systems.
In this work, we develop a systematic framework, PolyMPCNet, of joint overhead reduction of MPC comparison protocol and hardware acceleration.
arXiv Detail & Related papers (2022-09-20T02:47:37Z) - Federated Learning for Energy-limited Wireless Networks: A Partial Model
Aggregation Approach [79.59560136273917]
limited communication resources, bandwidth and energy, and data heterogeneity across devices are main bottlenecks for federated learning (FL)
We first devise a novel FL framework with partial model aggregation (PMA)
The proposed PMA-FL improves 2.72% and 11.6% accuracy on two typical heterogeneous datasets.
arXiv Detail & Related papers (2022-04-20T19:09:52Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned
Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network.
We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.