Related papers: AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts

AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts

URL: http://arxiv.org/abs/2404.05993v1
Date: Tue, 9 Apr 2024 03:54:28 GMT
Title: AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
Authors: Shaona Ghosh, Prasoon Varshney, Erick Galinkin, Christopher Parisien,
Abstract summary: As Large Language Models (LLMs) and generative AI become more widespread, the content safety risks associated with their use also increase. We find a notable deficiency in high-quality content safety datasets and benchmarks that comprehensively cover a wide range of critical safety areas. To address this, we define a broad content safety risk taxonomy, comprising 13 critical risk and 9 sparse risk categories.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Large Language Models (LLMs) and generative AI become more widespread, the content safety risks associated with their use also increase. We find a notable deficiency in high-quality content safety datasets and benchmarks that comprehensively cover a wide range of critical safety areas. To address this, we define a broad content safety risk taxonomy, comprising 13 critical risk and 9 sparse risk categories. Additionally, we curate AEGISSAFETYDATASET, a new dataset of approximately 26, 000 human-LLM interaction instances, complete with human annotations adhering to the taxonomy. We plan to release this dataset to the community to further research and to help benchmark LLM models for safety. To demonstrate the effectiveness of the dataset, we instruction-tune multiple LLM-based safety models. We show that our models (named AEGISSAFETYEXPERTS), not only surpass or perform competitively with the state-of-the-art LLM-based safety models and general purpose LLMs, but also exhibit robustness across multiple jail-break attack categories. We also show how using AEGISSAFETYDATASET during the LLM alignment phase does not negatively impact the performance of the aligned models on MT Bench scores. Furthermore, we propose AEGIS, a novel application of a no-regret online adaptation framework with strong theoretical guarantees, to perform content moderation with an ensemble of LLM content safety experts in deployment

Related papers

LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning [61.594212398272184]
Low-Rank Extrapolation (LoX) improves robustness against benign and malicious fine-tuning attacks.<n>LoX leads to 11% to 54% absolute reductions in attack success rates.
arXiv Detail & Related papers (2025-06-18T16:30:02Z)
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models [37.104276926258095]
Multimodal Large Language Models (MLLMs) pose unique safety challenges due to their integration of visual and textual data.<n>We introduce textbfDREAM (textittextbfDisentangling textbfRisks to textbfEnhance Safety textbfAlignment in textbfMLLMs), a novel approach that enhances safety alignment in MLLMs through supervised fine-tuning and iterative Reinforcement Learning from AI Feedback.
arXiv Detail & Related papers (2025-04-25T03:54:24Z)
SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings [32.661752596399204]
Multimodal Large Language Models (MLLMs) have serious security vulnerabilities. Existing low-resource security alignment methods, including textual alignment, have been found to struggle with the security risks posed by additional modalities. We propose Synthetic Embedding augmented safety alignment (SEA), which optimize embeddings of additional modality through gradient updates.
arXiv Detail & Related papers (2025-02-18T05:57:35Z)
A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations [127.52707312573791]
This survey provides a comprehensive analysis of LVLM safety, covering key aspects such as attacks, defenses, and evaluation methods. We introduce a unified framework that integrates these interrelated components, offering a holistic perspective on the vulnerabilities of LVLMs. We conduct a set of safety evaluations on the latest LVLM, Deepseek Janus-Pro, and provide a theoretical analysis of the results.
arXiv Detail & Related papers (2025-02-14T08:42:43Z)
Aegis2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails [4.697160328460634]
Large Language Models (LLMs) and generative AI become increasingly widespread. There is a clear lack of high-quality, human-annotated datasets that address the full spectrum of LLM-related safety risks. We propose a comprehensive and adaptable taxonomy for categorizing safety risks, structured into 12 top-level hazard categories with an extension to 9 fine-grained subcategories.
arXiv Detail & Related papers (2025-01-15T18:37:08Z)
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs. Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol. Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z)
ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2. Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z)
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models [46.148439517272024]
Generative large language models (LLMs) have revolutionized natural language processing with their transformative and emergent capabilities. Recent evidence indicates that LLMs can produce harmful content that violates social norms. We propose S-Eval, an automated Safety Evaluation framework with a newly defined comprehensive risk taxonomy.
arXiv Detail & Related papers (2024-05-23T05:34:31Z)
Safety Alignment for Vision Language Models [21.441662865727448]
We enhance the visual modality safety alignment of Vision Language Models (VLMs) by adding safety modules. Our method boasts ease of use, high flexibility, and strong controllability, and it enhances safety while having minimal impact on the model's general performance.
arXiv Detail & Related papers (2024-05-22T12:21:27Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety [27.843894102000608]
We conduct a first systematic review of open datasets for evaluating and improving large language models (LLMs) safety. We highlight trends, such as a trend towards fully synthetic datasets, as well as gaps in dataset coverage, such as a clear lack of non-English datasets. Our contributions are based on SafetyPrompts.com, a living catalogue of open datasets for LLM safety.
arXiv Detail & Related papers (2024-04-08T10:57:25Z)
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z)
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks. This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z)
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [107.82336341926134]
SALAD-Bench is a safety benchmark specifically designed for evaluating Large Language Models (LLMs) It transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.
arXiv Detail & Related papers (2024-02-07T17:33:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.