Related papers: In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

URL: http://arxiv.org/abs/2510.01259v1
Date: Thu, 25 Sep 2025 07:00:12 GMT
Title: In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b
Authors: Nils Durner,
Abstract summary: We study how sociopragmatic framing, language choice, and instruction hierarchy affect refusal behavior.<n>We test several harm domains including ZIP-bomb construction (cyber threat)<n>We find that the OpenAI Moderation API under-captures materially helpful outputs relative to a semantic grader.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We probe OpenAI's open-weights 20-billion-parameter model gpt-oss-20b to study how sociopragmatic framing, language choice, and instruction hierarchy affect refusal behavior. Across 80 seeded iterations per scenario, we test several harm domains including ZIP-bomb construction (cyber threat), synthetic card-number generation, minor-unsafe driving advice, drug-precursor indicators, and RAG context exfiltration. Composite prompts that combine an educator persona, a safety-pretext ("what to avoid"), and step-cue phrasing flip assistance rates from 0% to 97.5% on a ZIP-bomb task. On our grid, formal registers in German and French are often leakier than matched English prompts. A "Linux terminal" role-play overrides a developer rule not to reveal context in a majority of runs with a naive developer prompt, and we introduce an AI-assisted hardening method that reduces leakage to 0% in several user-prompt variants. We further test evaluation awareness with a paired-track design and measure frame-conditioned differences between matched "helpfulness" and "harmfulness" evaluation prompts; we observe inconsistent assistance in 13% of pairs. Finally, we find that the OpenAI Moderation API under-captures materially helpful outputs relative to a semantic grader, and that refusal rates differ by 5 to 10 percentage points across inference stacks, raising reproducibility concerns. We release prompts, seeds, outputs, and code for reproducible auditing at https://github.com/ndurner/gpt-oss-rt-run .

Related papers

Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis [0.0]
Adversarial comments produce small, statistically non-significant effects on detection accuracy.<n>Complex adversarial strategies offer no advantage over simple manipulative comments.<n>Comment stripping reduces detection for weaker models by removing helpful context.
arXiv Detail & Related papers (2026-02-18T00:34:17Z)
The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation [11.984098021215878]
We introduce the Semantic-Preserving Adrial Code Injection (SPACI) Framework and the Abstract Syntax Tree-Aware Semantic Injection Protocol (AST-ASIP)<n>These methods exploit the Syntax-Semantics Gap by embedding adversarial directives into syntactically inert regions (trivia nodes) of the Abstract Syntax Tree.<n>Through a large-scale evaluation of 9 SOTA models across 25,000 submissions in Python, C, C++, and Java, we reveal catastrophic failure rates (>95%) in high-capacity open-weights models like DeepSeek-V3.
arXiv Detail & Related papers (2026-01-29T07:40:58Z)
Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks [0.31984926651189866]
Sentra-Guard is a real-time modular defense system for large language models (LLMs)<n>The framework uses a hybrid architecture with FAISS-indexed SBERT embedding representations that capture the semantic meaning of prompts.<n>It identifies adversarial prompts in both direct and obfuscated attack vectors.
arXiv Detail & Related papers (2025-10-26T11:19:47Z)
DeRAG: Black-box Adversarial Attacks on Multiple Retrieval-Augmented Generation Applications via Prompt Injection [0.9499594220629591]
Adrial prompt attacks can significantly alter the reliability of Retrieval-Augmented Generation (RAG) systems.<n>We present a novel method that applies Differential Evolution (DE) to optimize adversarial prompt suffixes for RAG-based question answering.
arXiv Detail & Related papers (2025-07-20T16:48:20Z)
OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities [54.152681077418805]
Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalizations of model capabilities.<n>We propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities.<n>Our approach improves harmful prompt classification accuracy by 11.57% over the strongest baseline in a multilingual setting.
arXiv Detail & Related papers (2025-05-29T05:25:27Z)
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [81.44934796068495]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer (QA) pairs.<n>We propose a novel textitclean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Defending against Indirect Prompt Injection by Instruction Detection [109.30156975159561]
InstructDetector is a novel detection-based approach that leverages the behavioral states of LLMs to identify potential IPI attacks.<n>InstructDetector achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, and reduces the attack success rate to just 0.03% on the BIPIA benchmark.
arXiv Detail & Related papers (2025-05-08T13:04:45Z)
AuthorMist: Evading AI Text Detectors with Reinforcement Learning [4.806579822134391]
AuthorMist is a novel reinforcement learning-based system to transform AI-generated text into human-like writing.<n>We show that AuthorMist effectively reduces the detectability of AI-generated text while preserving the original meaning.
arXiv Detail & Related papers (2025-03-10T12:41:05Z)
Group-Adaptive Threshold Optimization for Robust AI-Generated Text Detection [58.419940585826744]
We introduce FairOPT, an algorithm for group-specific threshold optimization for probabilistic AI-text detectors.<n>We partitioned data into subgroups based on attributes (e.g., text length and writing style) and implemented FairOPT to learn decision thresholds for each group to reduce discrepancy.<n>Our framework paves the way for more robust classification in AI-generated content detection via post-processing.
arXiv Detail & Related papers (2025-02-06T21:58:48Z)
Enhancing AI Assisted Writing with One-Shot Implicit Negative Feedback [6.175028561101999]
Nifty is an approach that uses classifier guidance to controllably integrate implicit user feedback into the text generation process. We find up to 34% improvement in Rouge-L, 89% improvement in generating the correct intent, and an 86% win-rate according to human evaluators.
arXiv Detail & Related papers (2024-10-14T18:50:28Z)
Investigating the Influence of Prompt-Specific Shortcuts in AI Generated Text Detection [23.794925542322098]
We analyze the impact of prompt-specific shortcuts in AIGT detection. We propose Feedback-based Adversarial Instruction List Optimization (FAILOpt) FAILOpt effectively drops the detection performance of the target detector, comparable to other attacks based on adversarial in-context examples.
arXiv Detail & Related papers (2024-06-24T02:50:09Z)
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models [34.557309967708406]
In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. We design algorithms that can generate adversarial examples to jailbreak SLMs in both white-box and black-box attack settings without human involvement. Our models, trained on dialog data with speech instructions, achieve state-of-the-art performance on spoken question-answering task, scoring over 80% on both safety and helpfulness metrics.
arXiv Detail & Related papers (2024-05-14T04:51:23Z)
Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection [66.94175259287115]
We propose a novel backdoor attack setting tailored for instruction-tuned LLMs. In a VPI attack, a backdoored model is expected to respond as if an attacker-specified virtual prompt were formalized to the user instruction. We demonstrate the threat by poisoning the model's instruction tuning data.
arXiv Detail & Related papers (2023-07-31T17:56:00Z)
Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense [56.077252790310176]
We present a paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering. Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking. We introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider.
arXiv Detail & Related papers (2023-03-23T16:29:27Z)
Can AI-Generated Text be Reliably Detected? [50.95804851595018]
Large Language Models (LLMs) perform impressively well in various applications.<n>The potential for misuse of these models in activities such as plagiarism, generating fake news, and spamming has raised concern about their responsible use.<n>We stress-test the robustness of these AI text detectors in the presence of an attacker.
arXiv Detail & Related papers (2023-03-17T17:53:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.