Applying Pre-trained Multilingual BERT in Embeddings for Improved Malicious Prompt Injection Attacks Detection
- URL: http://arxiv.org/abs/2409.13331v1
- Date: Fri, 20 Sep 2024 08:48:51 GMT
- Title: Applying Pre-trained Multilingual BERT in Embeddings for Improved Malicious Prompt Injection Attacks Detection
- Authors: Md Abdur Rahman, Hossain Shahriar, Fan Wu, Alfredo Cuzzocrea,
- Abstract summary: Large language models (LLMs) are renowned for their exceptional capabilities, and applying to a wide range of applications.
This work focuses the impact of malicious prompt injection attacks which is one of most dangerous vulnerability on real LLMs applications.
It examines to apply various BERT (Bidirectional Representations from Transformers) like multilingual BERT, DistilBert for classifying malicious prompts from legitimate prompts.
- Score: 5.78117257526028
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are renowned for their exceptional capabilities, and applying to a wide range of applications. However, this widespread use brings significant vulnerabilities. Also, it is well observed that there are huge gap which lies in the need for effective detection and mitigation strategies against malicious prompt injection attacks in large language models, as current approaches may not adequately address the complexity and evolving nature of these vulnerabilities in real-world applications. Therefore, this work focuses the impact of malicious prompt injection attacks which is one of most dangerous vulnerability on real LLMs applications. It examines to apply various BERT (Bidirectional Encoder Representations from Transformers) like multilingual BERT, DistilBert for classifying malicious prompts from legitimate prompts. Also, we observed how tokenizing the prompt texts and generating embeddings using multilingual BERT contributes to improve the performance of various machine learning methods: Gaussian Naive Bayes, Random Forest, Support Vector Machine, and Logistic Regression. The performance of each model is rigorously analyzed with various parameters to improve the binary classification to discover malicious prompts. Multilingual BERT approach to embed the prompts significantly improved and outperformed the existing works and achieves an outstanding accuracy of 96.55% by Logistic regression. Additionally, we investigated the incorrect predictions of the model to gain insights into its limitations. The findings can guide researchers in tuning various BERT for finding the most suitable model for diverse LLMs vulnerabilities.
Related papers
- Embedding-based classifiers can detect prompt injection attacks [5.820776057182452]
Large Language Models (LLMs) are vulnerable to adversarial attacks, particularly prompt injection attacks.
We propose a novel approach based on embedding-based Machine Learning (ML) classifiers to protect LLM-based applications against this severe threat.
arXiv Detail & Related papers (2024-10-29T17:36:59Z) - Palisade -- Prompt Injection Detection Framework [0.9620910657090188]
Large Language Models are vulnerable to malicious prompt injection attacks.
This paper proposes a novel NLP based approach for prompt injection detection.
It emphasizes accuracy and optimization through a layered input screening process.
arXiv Detail & Related papers (2024-10-28T15:47:03Z) - Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection [6.269725911814401]
Large language models (LLMs) are becoming a popular tool as they have significantly advanced in their capability to tackle a wide range of language-based tasks.
However, LLMs applications are highly vulnerable to prompt injection attacks, which poses a critical problem.
This project explores the security vulnerabilities in relation to prompt injection attacks.
arXiv Detail & Related papers (2024-10-28T00:36:21Z) - On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts.
We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries.
Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z) - DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection [12.686480870065827]
This paper contributes textbfDLAP, a framework that combines the best of both deep learning (DL) models and Large Language Models (LLMs) to achieve exceptional vulnerability detection performance.
Experiment results confirm that DLAP outperforms state-of-the-art prompting frameworks, including role-based prompts, auxiliary information prompts, chain-of-thought prompts, and in-context learning prompts.
arXiv Detail & Related papers (2024-05-02T11:44:52Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - Defending Pre-trained Language Models as Few-shot Learners against
Backdoor Attacks [72.03945355787776]
We advocate MDP, a lightweight, pluggable, and effective defense for PLMs as few-shot learners.
We show analytically that MDP creates an interesting dilemma for the attacker to choose between attack effectiveness and detection evasiveness.
arXiv Detail & Related papers (2023-09-23T04:41:55Z) - COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in
Language Models [4.776465250559034]
We propose a prompt-based adversarial attack on manual templates in black box scenarios.
First of all, we design character-level and word-level approaches to break manual templates separately.
And we present a greedy algorithm for the attack based on the above destructive approaches.
arXiv Detail & Related papers (2023-06-09T03:53:42Z) - RLPrompt: Optimizing Discrete Text Prompts With Reinforcement Learning [84.75064077323098]
This paper proposes RLPrompt, an efficient discrete prompt optimization approach with reinforcement learning (RL)
RLPrompt is flexibly applicable to different types of LMs, such as masked gibberish (e.g., grammaBERT) and left-to-right models (e.g., GPTs)
Experiments on few-shot classification and unsupervised text style transfer show superior performance over a wide range of existing finetuning or prompting methods.
arXiv Detail & Related papers (2022-05-25T07:50:31Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images)
We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples.
Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.