CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment
- URL: http://arxiv.org/abs/2410.13903v1
- Date: Wed, 16 Oct 2024 08:14:24 GMT
- Title: CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment
- Authors: Qinfeng Li, Yangfan Xie, Tianyu Du, Zhiqiang Shen, Zhenghan Qin, Hao Peng, Xinkui Zhao, Xianwei Zhu, Jianwei Yin, Xuhong Zhang,
- Abstract summary: CoreGuard is a computation- and communication-efficient model protection approach against model stealing on edge devices.
We show that CoreGuard achieves the same security protection as the black-box security guarantees with negligible overhead.
- Score: 43.53211005936295
- License:
- Abstract: Proprietary large language models (LLMs) demonstrate exceptional generalization ability across various tasks. Additionally, deploying LLMs on edge devices is trending for efficiency and privacy reasons. However, edge deployment of proprietary LLMs introduces new security threats: attackers who obtain an edge-deployed LLM can easily use it as a base model for various tasks due to its high generalization ability, which we call foundational capability stealing. Unfortunately, existing model protection mechanisms are often task-specific and fail to protect general-purpose LLMs, as they mainly focus on protecting task-related parameters using trusted execution environments (TEEs). Although some recent TEE-based methods are able to protect the overall model parameters in a computation-efficient way, they still suffer from prohibitive communication costs between TEE and CPU/GPU, making it impractical to deploy for edge LLMs. To protect the foundational capabilities of edge LLMs, we propose CoreGuard, a computation- and communication-efficient model protection approach against model stealing on edge devices. The core component of CoreGuard is a lightweight and propagative authorization module residing in TEE. Extensive experiments show that CoreGuard achieves the same security protection as the black-box security guarantees with negligible overhead.
Related papers
- Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks [88.84977282952602]
A high volume of recent ML security literature focuses on attacks against aligned large language models (LLMs)
In this paper, we analyze security and privacy vulnerabilities that are unique to LLM agents.
We conduct a series of illustrative attacks on popular open-source and commercial agents, demonstrating the immediate practical implications of their vulnerabilities.
arXiv Detail & Related papers (2025-02-12T17:19:36Z) - Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings [12.80474396835751]
We develop a lightweight architecture for fine-tuning language models.
This method reduces the model size from LlamaGuard's 7 billion parameters to approximately 67 million.
It maintains comparable performance on the AEGIS safety benchmark.
arXiv Detail & Related papers (2024-11-21T18:27:25Z) - Position: On-Premises LLM Deployment Demands a Middle Path: Preserving Privacy Without Sacrificing Model Confidentiality [18.575663556525864]
We argue that deploying closed-source LLMs within user-controlled infrastructure enhances data privacy and mitigates misuse risks.
A well-designed on-premises deployment must ensure model confidentiality -- by preventing model theft -- and offer privacy-preserving customization.
Our findings demonstrate that privacy and confidentiality can coexist, paving the way for secure on-premises AI deployment.
arXiv Detail & Related papers (2024-10-15T02:00:36Z) - AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [84.96249955105777]
LLM agents may pose a greater risk if misused, but their robustness remains underexplored.
We propose a new benchmark called AgentHarm to facilitate research on LLM agent misuse.
We find leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking.
arXiv Detail & Related papers (2024-10-11T17:39:22Z) - LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models [15.900125475191958]
Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs)
We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models.
We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.
arXiv Detail & Related papers (2024-07-03T10:38:40Z) - GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning [79.07152553060601]
Existing methods for enhancing the safety of large language models (LLMs) are not directly transferable to LLM-powered agents.
We propose GuardAgent, the first LLM agent as a guardrail to other LLM agents.
GuardAgent comprises two steps: 1) creating a task plan by analyzing the provided guard requests, and 2) generating guardrail code based on the task plan and executing the code by calling APIs or using external engines.
arXiv Detail & Related papers (2024-06-13T14:49:26Z) - Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models [35.77228114378362]
Large Language Models (LLMs) generate malicious outputs when inputs contain specific "triggers" set by attackers.
Traditional defense strategies are impractical for API-accessible LLMs due to limited model access, high computational costs, and data requirements.
We propose Chain-of-Scrutiny (CoS) which leverages LLMs' unique reasoning abilities to mitigate backdoor attacks.
arXiv Detail & Related papers (2024-06-10T00:53:25Z) - Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)
Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.
Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z) - TransLinkGuard: Safeguarding Transformer Models Against Model Stealing in Edge Deployment [34.8682729537795]
We propose TransLinkGuard, a plug-and-play model protection approach against model stealing on edge devices.
The core part of TransLinkGuard is a lightweight authorization module residing in a secure environment.
Extensive experiments show that TransLinkGuard achieves the same security protection as the black-box security guarantees with negligible overhead.
arXiv Detail & Related papers (2024-04-17T07:08:45Z) - Baseline Defenses for Adversarial Attacks Against Aligned Language
Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses.
We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training.
We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.