Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings
- URL: http://arxiv.org/abs/2411.14398v1
- Date: Thu, 21 Nov 2024 18:27:25 GMT
- Title: Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings
- Authors: Aaron Zheng, Mansi Rana, Andreas Stolcke,
- Abstract summary: We develop a lightweight architecture for fine-tuning language models.
This method reduces the model size from LlamaGuard's 7 billion parameters to approximately 67 million.
It maintains comparable performance on the AEGIS safety benchmark.
- Score: 12.80474396835751
- License:
- Abstract: With the recent proliferation of large language models (LLMs), enterprises have been able to rapidly develop proof-of-concepts and prototypes. As a result, there is a growing need to implement robust guardrails that monitor, quantize and control an LLM's behavior, ensuring that the use is reliable, safe, accurate and also aligned with the users' expectations. Previous approaches for filtering out inappropriate user prompts or system outputs, such as LlamaGuard and OpenAI's MOD API, have achieved significant success by fine-tuning existing LLMs. However, using fine-tuned LLMs as guardrails introduces increased latency and higher maintenance costs, which may not be practical or scalable for cost-efficient deployments. We take a different approach, focusing on fine-tuning a lightweight architecture: Sentence-BERT. This method reduces the model size from LlamaGuard's 7 billion parameters to approximately 67 million, while maintaining comparable performance on the AEGIS safety benchmark.
Related papers
- CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment [43.53211005936295]
CoreGuard is a computation- and communication-efficient model protection approach against model stealing on edge devices.
We show that CoreGuard achieves the same security protection as the black-box security guarantees with negligible overhead.
arXiv Detail & Related papers (2024-10-16T08:14:24Z) - Tamper-Resistant Safeguards for Open-Weight LLMs [57.90526233549399]
We develop a method for building tamper-resistant safeguards into open-weight LLMs.
We find that our method greatly improves tamper-resistance while preserving benign capabilities.
Our results demonstrate that tamper-resistance is a tractable problem.
arXiv Detail & Related papers (2024-08-01T17:59:12Z) - LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models [15.900125475191958]
Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs)
We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models.
We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.
arXiv Detail & Related papers (2024-07-03T10:38:40Z) - Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)
Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.
Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z) - A Framework for Real-time Safeguarding the Text Generation of Large Language Model [12.683042228674694]
Large Language Models (LLMs) have significantly advanced natural language processing (NLP) tasks.
They pose ethical and societal risks due to their propensity to generate harmful content.
We propose LLMSafeGuard, a lightweight framework to safeguard LLM text generation in real-time.
arXiv Detail & Related papers (2024-04-29T18:40:01Z) - RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks.
This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z) - Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [98.02846901473697]
We propose ECSO (Eyes Closed, Safety On), a training-free protecting approach that exploits the inherent safety awareness of MLLMs.
ECSO generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of pre-aligned LLMs.
arXiv Detail & Related papers (2024-03-14T17:03:04Z) - MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development.
This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z) - Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs)
Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs.
Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.