Related papers: FLAME: Flexible LLM-Assisted Moderation Engine

FLAME: Flexible LLM-Assisted Moderation Engine

URL: http://arxiv.org/abs/2502.09175v1
Date: Thu, 13 Feb 2025 11:05:55 GMT
Title: FLAME: Flexible LLM-Assisted Moderation Engine
Authors: Ivan Bakulin, Ilia Kopanichuk, Iaroslav Bespalov, Nikita Radchenko, Vladimir Shaposhnikov, Dmitry Dylov, Ivan Oseledets,
Abstract summary: We introduce Flexible LLM-Assisted Moderation Engine (FLAME)<n>Unlike traditional circuit-breaking methods that analyze user queries, FLAME evaluates model responses.<n>Our experiments demonstrate that FLAME significantly outperforms current moderation systems.
Score: 2.966082563853265
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of Large Language Models (LLMs) has introduced significant challenges in moderating user-model interactions. While LLMs demonstrate remarkable capabilities, they remain vulnerable to adversarial attacks, particularly ``jailbreaking'' techniques that bypass content safety measures. Current content moderation systems, which primarily rely on input prompt filtering, have proven insufficient, with techniques like Best-of-N (BoN) jailbreaking achieving success rates of 80% or more against popular LLMs. In this paper, we introduce Flexible LLM-Assisted Moderation Engine (FLAME): a new approach that shifts the focus from input filtering to output moderation. Unlike traditional circuit-breaking methods that analyze user queries, FLAME evaluates model responses, offering several key advantages: (1) computational efficiency in both training and inference, (2) enhanced resistance to BoN jailbreaking attacks, and (3) flexibility in defining and updating safety criteria through customizable topic filtering. Our experiments demonstrate that FLAME significantly outperforms current moderation systems. For example, FLAME reduces attack success rate in GPT-4o-mini and DeepSeek-v3 by a factor of ~9, while maintaining low computational overhead. We provide comprehensive evaluation on various LLMs and analyze the engine's efficiency against the state-of-the-art jailbreaking. This work contributes to the development of more robust and adaptable content moderation systems for LLMs.

Related papers

ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks [61.06621533874629]
In-context learning (ICL) has demonstrated remarkable success in large language models (LLMs)<n>In this paper, we propose, for the first time, the dual-learning hypothesis, which posits that LLMs simultaneously learn both the task-relevant latent concepts and backdoor latent concepts.<n>Motivated by these findings, we propose ICLShield, a defense mechanism that dynamically adjusts the concept preference ratio.
arXiv Detail & Related papers (2025-07-02T03:09:20Z)
Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs [15.640342726041732]
Attacks on large language models (LLMs) in jailbreaking scenarios raise many security and ethical issues.<n>Current jailbreak attack methods face problems such as low efficiency, high computational cost, and poor cross-model adaptability.<n>Our work proposes an Adversarial Prompt Distillation, which combines masked language modeling, reinforcement learning, and dynamic temperature control.
arXiv Detail & Related papers (2025-05-26T08:27:51Z)
Helping Large Language Models Protect Themselves: An Enhanced Filtering and Summarization System [2.0257616108612373]
Large Language Models are vulnerable to adversarial assaults, manipulative prompts, and encoded malicious inputs.<n>This study presents a unique defense paradigm that allows LLMs to recognize, filter, and defend against adversarial or malicious inputs on their own.
arXiv Detail & Related papers (2025-05-02T14:42:26Z)
Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary [2.4329261266984346]
Large Language Models (LLMs) are designed to generate helpful and safe content. adversarial attacks, commonly referred to as jailbreak, can bypass their safety protocols. We introduce a novel jailbreak attack method that leverages the prefilling feature of LLMs.
arXiv Detail & Related papers (2025-04-28T07:38:43Z)
Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation.<n>Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
arXiv Detail & Related papers (2025-02-03T18:59:01Z)
You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense [34.023473699165315]
We study the utility degradation, safety elevation, and exaggerated-safety escalation of LLMs with jailbreak defense strategies. We find that mainstream jailbreak defenses fail to ensure both safety and performance simultaneously.
arXiv Detail & Related papers (2025-01-21T15:24:29Z)
LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models [59.29840790102413]
Existing jailbreak attacks are primarily based on opaque optimization techniques and gradient search methods.<n>We propose LLM-Virus, a jailbreak attack method based on evolutionary algorithm, termed evolutionary jailbreak.<n>Our results show that LLM-Virus achieves competitive or even superior performance compared to existing attack methods.
arXiv Detail & Related papers (2024-12-28T07:48:57Z)
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level [10.476222570886483]
Large language models (LLMs) have demonstrated immense utility across various industries. As LLMs advance, the risk of harmful outputs increases due to incorrect or malicious instruction prompts. This paper examines the LLMs' capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens.
arXiv Detail & Related papers (2024-10-09T12:09:30Z)
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models [8.024771725860127]
Jailbreak attacks manipulate large language models into generating harmful content.<n>Jailbreak Antidote enables real-time adjustment of safety preferences by manipulating a sparse subset of the model's internal states.<n>Our analysis reveals that safety-related information in LLMs is sparsely distributed.
arXiv Detail & Related papers (2024-10-03T08:34:17Z)
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing [63.20133320524577]
We show that editing a small subset of parameters can effectively modulate specific behaviors of large language models (LLMs)<n>Our approach achieves reductions of up to 90.0% in toxicity on the RealToxicityPrompts dataset and 49.2% on ToxiGen.
arXiv Detail & Related papers (2024-07-11T17:52:03Z)
Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation [15.928341917085467]
JailMine employs an automated "mining" process to elicit malicious responses from large language models. We demonstrate JailMine's effectiveness and efficiency, achieving a significant average reduction of 86% in time consumed.
arXiv Detail & Related papers (2024-05-20T17:17:55Z)
Advancing the Robustness of Large Language Models through Self-Denoised Smoothing [50.54276872204319]
Large language models (LLMs) have achieved significant success, but their vulnerability to adversarial perturbations has raised considerable concerns. We propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions. Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility.
arXiv Detail & Related papers (2024-04-18T15:47:00Z)
FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z)
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks. This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.