OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities
- URL: http://arxiv.org/abs/2505.23856v1
- Date: Thu, 29 May 2025 05:25:27 GMT
- Title: OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities
- Authors: Sahil Verma, Keegan Hines, Jeff Bilmes, Charlotte Siska, Luke Zettlemoyer, Hila Gonen, Chandan Singh,
- Abstract summary: Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalizations of model capabilities.<n>We propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities.<n>Our approach improves harmful prompt classification accuracy by 11.57% over the strongest baseline in a multilingual setting.
- Score: 54.152681077418805
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient ($\approx 120 \times$ faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.
Related papers
- Learning to Focus: Context Extraction for Efficient Code Vulnerability Detection with Language Models [16.23854525619129]
Language models (LMs) show promise for vulnerability detection but struggle with long, real-world code due to sparse and uncertain vulnerability locations.<n>We propose FocusVul, a model-agnostic framework that improves LM-based vulnerability detection by learning to select sensitive context.
arXiv Detail & Related papers (2025-05-23T04:41:54Z) - A Preliminary Study of Large Language Models for Multilingual Vulnerability Detection [13.269680075539135]
Large language models (LLMs) offer language-agnostic capabilities and enhanced semantic understanding.<n>Recent advancements in large language models (LLMs) offer language-agnostic capabilities and enhanced semantic understanding.<n>Our findings reveal that the PLM CodeT5P achieves the best performance in multilingual vulnerability detection.
arXiv Detail & Related papers (2025-05-12T09:19:31Z) - Can Prompting LLMs Unlock Hate Speech Detection across Languages? A Zero-shot and Few-shot Study [59.30098850050971]
This work evaluates LLM prompting-based detection across eight non-English languages.<n>We show that while zero-shot and few-shot prompting lag behind fine-tuned encoder models on most of the real-world evaluation sets, they achieve better generalization on functional tests for hate speech detection.
arXiv Detail & Related papers (2025-05-09T16:00:01Z) - MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety [56.79292318645454]
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking.<n>This vulnerability is exacerbated in multilingual settings, where multilingual safety-aligned data is often limited.<n>We introduce a multilingual guardrail with reasoning for prompt classification.
arXiv Detail & Related papers (2025-04-21T17:15:06Z) - Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning [1.532756501930393]
We investigate the potential of pre-trained audio representations for detecting abusive language in low-resource languages.<n>Our approach integrates representations within the Model-Agnostic Meta-Learning framework to classify abusive language in 10 languages.
arXiv Detail & Related papers (2024-12-02T11:51:19Z) - Applying Pre-trained Multilingual BERT in Embeddings for Improved Malicious Prompt Injection Attacks Detection [5.78117257526028]
Large language models (LLMs) are renowned for their exceptional capabilities, and applying to a wide range of applications.
This work focuses the impact of malicious prompt injection attacks which is one of most dangerous vulnerability on real LLMs applications.
It examines to apply various BERT (Bidirectional Representations from Transformers) like multilingual BERT, DistilBert for classifying malicious prompts from legitimate prompts.
arXiv Detail & Related papers (2024-09-20T08:48:51Z) - SMILE: Speech Meta In-Context Learning for Low-Resource Language Automatic Speech Recognition [55.2480439325792]
Speech Meta In-Context LEarning (SMILE) is an innovative framework that combines meta-learning with speech in-context learning (SICL)<n>We show that SMILE consistently outperforms baseline methods in training-free few-shot multilingual ASR tasks.
arXiv Detail & Related papers (2024-09-16T16:04:16Z) - DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection [12.686480870065827]
This paper contributes textbfDLAP, a framework that combines the best of both deep learning (DL) models and Large Language Models (LLMs) to achieve exceptional vulnerability detection performance.
Experiment results confirm that DLAP outperforms state-of-the-art prompting frameworks, including role-based prompts, auxiliary information prompts, chain-of-thought prompts, and in-context learning prompts.
arXiv Detail & Related papers (2024-05-02T11:44:52Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.