Peering Behind the Shield: Guardrail Identification in Large Language Models
- URL: http://arxiv.org/abs/2502.01241v1
- Date: Mon, 03 Feb 2025 11:02:30 GMT
- Title: Peering Behind the Shield: Guardrail Identification in Large Language Models
- Authors: Ziqing Yang, Yixin Wu, Rui Wen, Michael Backes, Yang Zhang,
- Abstract summary: In this work, we propose a novel method, AP-Test, which identifies the presence of a candidate guardrail by leveraging guardrail-specific adversarial prompts to query the AI agent.
Extensive experiments of four candidate guardrails under diverse scenarios showcase the effectiveness of our method.
- Score: 22.78318541483925
- License:
- Abstract: Human-AI conversations have gained increasing attention since the era of large language models. Consequently, more techniques, such as input/output guardrails and safety alignment, are proposed to prevent potential misuse of such Human-AI conversations. However, the ability to identify these guardrails has significant implications, both for adversarial exploitation and for auditing purposes by red team operators. In this work, we propose a novel method, AP-Test, which identifies the presence of a candidate guardrail by leveraging guardrail-specific adversarial prompts to query the AI agent. Extensive experiments of four candidate guardrails under diverse scenarios showcase the effectiveness of our method. The ablation study further illustrates the importance of the components we designed, such as the loss terms.
Related papers
- Prompt Inject Detection with Generative Explanation as an Investigative Tool [0.0]
Large Language Models (LLMs) are vulnerable to adversarial prompt based injects.
This research explores the use of a text generation capabilities of LLM to detect prompt injects.
arXiv Detail & Related papers (2025-02-16T06:16:00Z) - Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents [23.960719833886984]
M-CoDAL is a multimodal-dialogue system specifically designed for embodied agents to better understand and communicate in safety-critical situations.
Our approach is evaluated using a newly created multimodal dataset comprising 1K safety violations extracted from 2K Reddit images.
Results with this dataset demonstrate that our approach improves resolution of safety situations, user sentiment, as well as safety of the conversation.
arXiv Detail & Related papers (2024-10-18T03:26:06Z) - Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models [13.225041704917905]
This study unveils an attack mechanism that capitalizes on human conversation strategies to extract harmful information from large language models.
Unlike conventional methods that target explicit malicious responses, our approach delves deeper into the nature of the information provided in responses.
arXiv Detail & Related papers (2024-07-22T06:04:29Z) - RAG-based Crowdsourcing Task Decomposition via Masked Contrastive Learning with Prompts [21.69333828191263]
We propose a retrieval-augmented generation-based crowdsourcing framework that reimagines task decomposition (TD) as event detection from the perspective of natural language understanding.
We present a Prompt-Based Contrastive learning framework for TD (PBCT), which incorporates a prompt-based trigger detector to overcome dependence.
Experiment results demonstrate the competitiveness of our method in both supervised and zero-shot detection.
arXiv Detail & Related papers (2024-06-04T08:34:19Z) - Task-Agnostic Detector for Insertion-Based Backdoor Attacks [53.77294614671166]
We introduce TABDet (Task-Agnostic Backdoor Detector), a pioneering task-agnostic method for backdoor detection.
TABDet leverages final layer logits combined with an efficient pooling technique, enabling unified logit representation across three prominent NLP tasks.
TABDet can jointly learn from diverse task-specific models, demonstrating superior detection efficacy over traditional task-specific methods.
arXiv Detail & Related papers (2024-03-25T20:12:02Z) - Pre-trained Trojan Attacks for Visual Recognition [106.13792185398863]
Pre-trained vision models (PVMs) have become a dominant component due to their exceptional performance when fine-tuned for downstream tasks.
We propose the Pre-trained Trojan attack, which embeds backdoors into a PVM, enabling attacks across various downstream vision tasks.
We highlight the challenges posed by cross-task activation and shortcut connections in successful backdoor attacks.
arXiv Detail & Related papers (2023-12-23T05:51:40Z) - Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual
Active Speaker Detection [88.74863771919445]
We reveal the vulnerability of AVASD models under audio-only, visual-only, and audio-visual adversarial attacks.
We also propose a novel audio-visual interaction loss (AVIL) for making attackers difficult to find feasible adversarial examples.
arXiv Detail & Related papers (2022-10-03T08:10:12Z) - Exploring Robustness of Unsupervised Domain Adaptation in Semantic
Segmentation [74.05906222376608]
We propose adversarial self-supervision UDA (or ASSUDA) that maximizes the agreement between clean images and their adversarial examples by a contrastive loss in the output space.
This paper is rooted in two observations: (i) the robustness of UDA methods in semantic segmentation remains unexplored, which pose a security concern in this field; and (ii) although commonly used self-supervision (e.g., rotation and jigsaw) benefits image tasks such as classification and recognition, they fail to provide the critical supervision signals that could learn discriminative representation for segmentation tasks.
arXiv Detail & Related papers (2021-05-23T01:50:44Z) - An Adversarially-Learned Turing Test for Dialog Generation Models [45.991035017908594]
We propose an adversarial training approach to learn a robust model, ATT, that discriminates machine-generated responses from human-written replies.
In contrast to previous perturbation-based methods, our discriminator is trained by iteratively generating unrestricted and diverse adversarial examples.
Our discriminator shows high accuracy on strong attackers including DialoGPT and GPT-3.
arXiv Detail & Related papers (2021-04-16T17:13:14Z) - Probing Task-Oriented Dialogue Representation from Language Models [106.02947285212132]
This paper investigates pre-trained language models to find out which model intrinsically carries the most informative representation for task-oriented dialogue tasks.
We fine-tune a feed-forward layer as the classifier probe on top of a fixed pre-trained language model with annotated labels in a supervised way.
arXiv Detail & Related papers (2020-10-26T21:34:39Z) - Exploiting Unsupervised Data for Emotion Recognition in Conversations [76.01690906995286]
Emotion Recognition in Conversations (ERC) aims to predict the emotional state of speakers in conversations.
The available supervised data for the ERC task is limited.
We propose a novel approach to leverage unsupervised conversation data.
arXiv Detail & Related papers (2020-10-02T13:28:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.