Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots
- URL: http://arxiv.org/abs/2603.01942v1
- Date: Mon, 02 Mar 2026 14:57:13 GMT
- Title: Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots
- Authors: Huw Day, Adrianna Jezierska, Jessica Woodgate,
- Abstract summary: Large Language Models have intensified the scale and strategic manipulation of political discourse on social media.<n>We propose a user-centric view of "jailbreaking" as an emergent, non-violent de-escalation practice.
- Score: 1.1470070927586018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models have intensified the scale and strategic manipulation of political discourse on social media, leading to conflict escalation. The existing literature largely focuses on platform-led moderation as a countermeasure. In this paper, we propose a user-centric view of "jailbreaking" as an emergent, non-violent de-escalation practice. Online users engage with suspected LLM-powered accounts to circumvent large language model safeguards, exposing automated behaviour and disrupting the circulation of misleading narratives.
Related papers
- Jailbreaking Large Language Models through Iterative Tool-Disguised Attacks via Reinforcement Learning [26.571996871795154]
iMIST (underlineinteractive underlineMulti-step underlineProgreunderlinessive underlineTool-disguised Jailbreak Attack) is a novel adaptive jailbreak method that exploits vulnerabilities in current defense mechanisms.<n>Experiments on widely-used models demonstrate that iMIST achieves higher attack effectiveness, while maintaining low rejection rates.
arXiv Detail & Related papers (2026-01-09T01:41:39Z) - The Imitation Game: Using Large Language Models as Chatbots to Combat Chat-Based Cybercrimes [24.05325129572158]
Chat-based cybercrime has emerged as a pervasive threat.<n>Traditional defense mechanisms struggle to identify these conversational threats.<n>We present LURE, the first system to deploy Large Language Models as active agents.
arXiv Detail & Related papers (2025-12-24T05:34:05Z) - XBreaking: Understanding how LLMs security alignment can be broken [3.9140217233340544]
Large Language Models are fundamental actors in the modern IT landscape dominated by AI solutions.<n>We propose an Explainable-AI solution that comparatively analyzes the behavior of censored and uncensored models to derive unique exploitable alignment patterns.<n>Then, we propose XBreaking, a novel approach that exploits these unique patterns to break the security and alignment constraints of LLMs by targeted noise injection.
arXiv Detail & Related papers (2025-04-30T14:44:24Z) - Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z) - IOHunter: Graph Foundation Model to Uncover Online Information Operations [8.532129691916348]
We introduce a methodology designed to identify users orchestrating information operations, a.k.a. IO drivers, across various influence campaigns.<n>Our framework, named IOHunter, leverages the combined strengths of Language Models and Graph Neural Networks to improve generalization in supervised, scarcely-supervised, and cross-IO contexts.<n>This research marks a step toward developing Graph Foundation Models specifically tailored for the task of IO detection on social media platforms.
arXiv Detail & Related papers (2024-12-19T09:14:24Z) - Jailbreak Attacks and Defenses Against Large Language Models: A Survey [22.392989536664288]
Large Language Models (LLMs) have performed exceptionally in various text-generative tasks.
"jailbreaking" induces the model to generate malicious responses against the usage policy and society.
We propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods.
arXiv Detail & Related papers (2024-07-05T06:57:30Z) - Demarked: A Strategy for Enhanced Abusive Speech Moderation through Counterspeech, Detoxification, and Message Management [71.99446449877038]
We propose a more comprehensive approach called Demarcation scoring abusive speech based on four aspect -- (i) severity scale; (ii) presence of a target; (iii) context scale; (iv) legal scale.
Our work aims to inform future strategies for effectively addressing abusive speech online.
arXiv Detail & Related papers (2024-06-27T21:45:33Z) - Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings [57.136748215262884]
We introduce ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data.<n>We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary.<n>Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z) - Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks [55.603893267803265]
Large Language Models (LLMs) are susceptible to Jailbreaking attacks.
Jailbreaking attacks aim to extract harmful information by subtly modifying the attack query.
We focus on a new attack form, called Contextual Interaction Attack.
arXiv Detail & Related papers (2024-02-14T13:45:19Z) - Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations [38.437893814759086]
Large Language Models (LLMs) have shown remarkable success in various tasks, yet their safety and the risk of generating harmful content remain pressing concerns.
We propose the In-Context Attack (ICA) which employs harmful demonstrations to subvert LLMs, and the In-Context Defense (ICD) which bolsters model resilience through examples that demonstrate refusal to produce harmful responses.
arXiv Detail & Related papers (2023-10-10T07:50:29Z) - Social Media Influence Operations [0.0]
This article reviews developments at the intersection of Large Language Models (LLMs) and influence operations.
LLMs are able to generate targeted and persuasive text which is for the most part indistinguishable from human-written content.
mitigation measures for the near future are highlighted.
arXiv Detail & Related papers (2023-09-07T12:18:07Z) - Universal and Transferable Adversarial Attacks on Aligned Language
Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.
Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z) - Countering Malicious Content Moderation Evasion in Online Social
Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems.
This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.