Related papers: Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings

Related papers

Taxonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models [3.710103086278309]
Large Language Models (LLMs) are typically aligned for safety during the post-training phase.<n>They may still generate inappropriate outputs that could potentially pose risks to users.<n>This challenge underscores the need for robust safeguards that operate across both model inputs and outputs.
arXiv Detail & Related papers (2025-12-05T00:43:55Z)
PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization [0.0]
This paper introduces a novel framework for hardening system prompts through shield appending.<n>We leverage an LLM-as-optimizer to search the space of possible SHIELDs, seeking to minimize a leakage metric derived from a suite of adversarial attacks.<n>We demonstrate empirically that our optimized SHIELDs significantly reduce prompt leakage against a comprehensive set of extraction attacks.
arXiv Detail & Related papers (2025-11-20T10:25:45Z)
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z)
SBASH: a Framework for Designing and Evaluating RAG vs. Prompt-Tuned LLM Honeypots [0.0]
Honeypots are decoy systems used for gathering valuable threat intelligence or diverting attackers away from production systems.<n>We propose the System-Based Attention Shell Honeypot framework which manages data-protection issues through the use of lightweight local LLMs.
arXiv Detail & Related papers (2025-10-24T13:41:52Z)
FedShield-LLM: A Secure and Scalable Federated Fine-Tuned Large Language Model [0.48342038441006796]
Federated Learning (FL) offers a decentralized framework for training and fine-tuning Large Language Models (LLMs)<n>FL addresses privacy and security concerns while navigating challenges associated with the substantial computational demands of LLMs.<n>We propose a novel method, FedShield-LLM, that uses pruning with Fully Homomorphic Encryption (FHE) for Low-Rank Adaptation (LoRA) parameters.
arXiv Detail & Related papers (2025-06-06T00:05:05Z)
LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities [1.460362586787935]
LiteLMGuard (LLMG) provides real-time, prompt-level defense for quantized SLMs.<n>LLMG formalizes prompt filtering as a deep learning (DL)-based prompt answerability classification task.<n>LLMG defends against over 87% of harmful prompts, including both direct instruction and jailbreak attack strategies.
arXiv Detail & Related papers (2025-05-08T19:58:41Z)
Enhancing Smart Contract Vulnerability Detection in DApps Leveraging Fine-Tuned LLM [0.7018579932647147]
Decentralized applications (DApps) face significant security risks due to vulnerabilities in smart contracts. This paper proposes a novel approach leveraging fine-tuned Large Language Models (LLMs) to enhance smart contract vulnerability detection.
arXiv Detail & Related papers (2025-04-07T12:32:14Z)
Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
arXiv Detail & Related papers (2025-02-03T18:59:01Z)
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging [43.44112117935541]
Fine-tuning large language models (LLMs) for downstream tasks often leads to safety degradation in safety-aligned LLMs. We propose a method that maintains the inherent safety of LLMs while enhancing their downstream task performance.
arXiv Detail & Related papers (2024-12-27T08:03:22Z)
Adversarial Vulnerabilities in Large Language Models for Time Series Forecasting [14.579802892916101]
Large Language Models (LLMs) have recently demonstrated significant potential in time series forecasting. However, their robustness and reliability in real-world applications remain under-explored. We introduce a targeted adversarial attack framework for LLM-based time series forecasting.
arXiv Detail & Related papers (2024-12-11T04:53:15Z)
ADVLLM: Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [63.603861880022954]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability.<n>Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs.<n>It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)
CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment [43.53211005936295]
CoreGuard is a computation- and communication-efficient model protection approach against model stealing on edge devices. We show that CoreGuard achieves the same security protection as the black-box security guarantees with negligible overhead.
arXiv Detail & Related papers (2024-10-16T08:14:24Z)
Tamper-Resistant Safeguards for Open-Weight LLMs [57.90526233549399]
We develop a method for building tamper-resistant safeguards into open-weight LLMs. We find that our method greatly improves tamper-resistance while preserving benign capabilities. Our results demonstrate that tamper-resistance is a tractable problem.
arXiv Detail & Related papers (2024-08-01T17:59:12Z)
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models [15.900125475191958]
Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs) We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models. We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.
arXiv Detail & Related papers (2024-07-03T10:38:40Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs) Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
A Framework for Real-time Safeguarding the Text Generation of Large Language Model [12.683042228674694]
Large Language Models (LLMs) have significantly advanced natural language processing (NLP) tasks. They pose ethical and societal risks due to their propensity to generate harmful content. We propose LLMSafeGuard, a lightweight framework to safeguard LLM text generation in real-time.
arXiv Detail & Related papers (2024-04-29T18:40:01Z)
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks. This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z)
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [98.02846901473697]
We propose ECSO (Eyes Closed, Safety On), a training-free protecting approach that exploits the inherent safety awareness of MLLMs. ECSO generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of pre-aligned LLMs.
arXiv Detail & Related papers (2024-03-14T17:03:04Z)
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development. This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z)
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs) Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs. Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.