Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings
        - URL: http://arxiv.org/abs/2411.14398v1
- Date: Thu, 21 Nov 2024 18:27:25 GMT
- Title: Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings
- Authors: Aaron Zheng, Mansi Rana, Andreas Stolcke, 
- Abstract summary: We develop a lightweight architecture for fine-tuning language models.
This method reduces the model size from LlamaGuard's 7 billion parameters to approximately 67 million.
It maintains comparable performance on the AEGIS safety benchmark.
- Score: 12.80474396835751
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   With the recent proliferation of large language models (LLMs), enterprises have been able to rapidly develop proof-of-concepts and prototypes. As a result, there is a growing need to implement robust guardrails that monitor, quantize and control an LLM's behavior, ensuring that the use is reliable, safe, accurate and also aligned with the users' expectations. Previous approaches for filtering out inappropriate user prompts or system outputs, such as LlamaGuard and OpenAI's MOD API, have achieved significant success by fine-tuning existing LLMs. However, using fine-tuned LLMs as guardrails introduces increased latency and higher maintenance costs, which may not be practical or scalable for cost-efficient deployments. We take a different approach, focusing on fine-tuning a lightweight architecture: Sentence-BERT. This method reduces the model size from LlamaGuard's 7 billion parameters to approximately 67 million, while maintaining comparable performance on the AEGIS safety benchmark. 
 
      
        Related papers
        - FedShield-LLM: A Secure and Scalable Federated Fine-Tuned Large Language   Model [0.48342038441006796]
 Federated Learning (FL) offers a decentralized framework for training and fine-tuning Large Language Models (LLMs)<n>FL addresses privacy and security concerns while navigating challenges associated with the substantial computational demands of LLMs.<n>We propose a novel method, FedShield-LLM, that uses pruning with Fully Homomorphic Encryption (FHE) for Low-Rank Adaptation (LoRA) parameters.
 arXiv  Detail & Related papers  (2025-06-06T00:05:05Z)
- LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for   Safeguarding Small Language Models against Quantization-induced Risks and   Vulnerabilities [1.460362586787935]
 LiteLMGuard (LLMG) provides real-time, prompt-level defense for quantized SLMs.<n>LLMG formalizes prompt filtering as a deep learning (DL)-based prompt answerability classification task.<n>LLMG defends against over 87% of harmful prompts, including both direct instruction and jailbreak attack strategies.
 arXiv  Detail & Related papers  (2025-05-08T19:58:41Z)
- Enhancing Smart Contract Vulnerability Detection in DApps Leveraging   Fine-Tuned LLM [0.7018579932647147]
 Decentralized applications (DApps) face significant security risks due to vulnerabilities in smart contracts.
This paper proposes a novel approach leveraging fine-tuned Large Language Models (LLMs) to enhance smart contract vulnerability detection.
 arXiv  Detail & Related papers  (2025-04-07T12:32:14Z)
- Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
 We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation.
Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
 arXiv  Detail & Related papers  (2025-02-03T18:59:01Z)
- Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging [43.44112117935541]
 Fine-tuning large language models (LLMs) for downstream tasks often leads to safety degradation in safety-aligned LLMs.
We propose a method that maintains the inherent safety of LLMs while enhancing their downstream task performance.
 arXiv  Detail & Related papers  (2024-12-27T08:03:22Z)
- Adversarial Vulnerabilities in Large Language Models for Time Series   Forecasting [14.579802892916101]
 Large Language Models (LLMs) have recently demonstrated significant potential in time series forecasting.
However, their robustness and reliability in real-world applications remain under-explored.
We introduce a targeted adversarial attack framework for LLM-based time series forecasting.
 arXiv  Detail & Related papers  (2024-12-11T04:53:15Z)
- ADVLLM: Iterative Self-Tuning LLMs for Enhanced Jailbreaking   Capabilities [63.603861880022954]
 We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability.<n>Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs.<n>It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
 arXiv  Detail & Related papers  (2024-10-24T06:36:12Z)
- CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model   Stealing in Edge Deployment [43.53211005936295]
 CoreGuard is a computation- and communication-efficient model protection approach against model stealing on edge devices.
We show that CoreGuard achieves the same security protection as the black-box security guarantees with negligible overhead.
 arXiv  Detail & Related papers  (2024-10-16T08:14:24Z)
- Tamper-Resistant Safeguards for Open-Weight LLMs [57.90526233549399]
 We develop a method for building tamper-resistant safeguards into open-weight LLMs.
We find that our method greatly improves tamper-resistance while preserving benign capabilities.
Our results demonstrate that tamper-resistance is a tractable problem.
 arXiv  Detail & Related papers  (2024-08-01T17:59:12Z)
- LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content   Moderation of Large Language Models [15.900125475191958]
 Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs)
We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models.
We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.
 arXiv  Detail & Related papers  (2024-07-03T10:38:40Z)
- Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs   against Jailbreak Attacks [59.46556573924901]
 This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)
Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.
 Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
 arXiv  Detail & Related papers  (2024-05-30T14:40:35Z)
- A Framework for Real-time Safeguarding the Text Generation of Large   Language Model [12.683042228674694]
 Large Language Models (LLMs) have significantly advanced natural language processing (NLP) tasks.
They pose ethical and societal risks due to their propensity to generate harmful content.
We propose LLMSafeGuard, a lightweight framework to safeguard LLM text generation in real-time.
 arXiv  Detail & Related papers  (2024-04-29T18:40:01Z)
- RigorLLM: Resilient Guardrails for Large Language Models against   Undesired Content [62.685566387625975]
 Current mitigation strategies, while effective, are not resilient under adversarial attacks.
This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
 arXiv  Detail & Related papers  (2024-03-19T07:25:02Z)
- Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text   Transformation [98.02846901473697]
 We propose ECSO (Eyes Closed, Safety On), a training-free protecting approach that exploits the inherent safety awareness of MLLMs.
ECSO generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of pre-aligned LLMs.
 arXiv  Detail & Related papers  (2024-03-14T17:03:04Z)
- MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
 "Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development.
This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
 arXiv  Detail & Related papers  (2024-02-26T18:59:03Z)
- Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
 We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs)
Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs.
Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
 arXiv  Detail & Related papers  (2023-10-13T07:38:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.