LLavaGuard: VLM-based Safeguards for Vision Dataset Curation and Safety Assessment
- URL: http://arxiv.org/abs/2406.05113v1
- Date: Fri, 7 Jun 2024 17:44:32 GMT
- Title: LLavaGuard: VLM-based Safeguards for Vision Dataset Curation and Safety Assessment
- Authors: Lukas Helff, Felix Friedrich, Manuel Brack, Kristian Kersting, Patrick Schramowski,
- Abstract summary: We introduce LlavaGuard, a family of VLM-based safeguard models.
LlavaGuard offers a versatile framework for evaluating the safety compliance of visual content.
Our experiments highlight the capabilities of LlavaGuard in complex and real-world applications.
- Score: 26.148022772521493
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce LlavaGuard, a family of VLM-based safeguard models, offering a versatile framework for evaluating the safety compliance of visual content. Specifically, we designed LlavaGuard for dataset annotation and generative model safeguarding. To this end, we collected and annotated a high-quality visual dataset incorporating a broad safety taxonomy, which we use to tune VLMs on context-aware safety risks. As a key innovation, LlavaGuard's new responses contain comprehensive information, including a safety rating, the violated safety categories, and an in-depth rationale. Further, our introduced customizable taxonomy categories enable the context-specific alignment of LlavaGuard to various scenarios. Our experiments highlight the capabilities of LlavaGuard in complex and real-world applications. We provide checkpoints ranging from 7B to 34B parameters demonstrating state-of-the-art performance, with even the smallest models outperforming baselines like GPT-4. We make our dataset and model weights publicly available and invite further research to address the diverse needs of communities and contexts.
Related papers
- SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs.
Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol.
Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z) - ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2.
Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z) - SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model [77.86593720792986]
We propose a Safety Preference Alignment dataset for Vision Language Models named SPA-VL.
In terms of breadth, SPA-VL covers 6 harmfulness domains, 13 categories, and 53 subcategories, and contains 100,788 samples of the quadruple (question, image, chosen response, rejected response)
The experimental results indicate that models trained with alignment techniques on the SPA-VL dataset exhibit substantial improvements in harmlessness and helpfulness while maintaining core capabilities.
arXiv Detail & Related papers (2024-06-17T18:57:37Z) - Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models [65.06446825020578]
Safety alignment is crucial to ensure that large language models (LLMs) behave in ways that align with human preferences and prevent harmful actions during inference.
We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape.
arXiv Detail & Related papers (2024-05-27T17:31:56Z) - Safety Alignment for Vision Language Models [21.441662865727448]
We enhance the visual modality safety alignment of Vision Language Models (VLMs) by adding safety modules.
Our method boasts ease of use, high flexibility, and strong controllability, and it enhances safety while having minimal impact on the model's general performance.
arXiv Detail & Related papers (2024-05-22T12:21:27Z) - AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts [0.0]
As Large Language Models (LLMs) and generative AI become more widespread, the content safety risks associated with their use also increase.
We find a notable deficiency in high-quality content safety datasets and benchmarks that comprehensively cover a wide range of critical safety areas.
To address this, we define a broad content safety risk taxonomy, comprising 13 critical risk and 9 sparse risk categories.
arXiv Detail & Related papers (2024-04-09T03:54:28Z) - SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [107.82336341926134]
SALAD-Bench is a safety benchmark specifically designed for evaluating Large Language Models (LLMs)
It transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.
arXiv Detail & Related papers (2024-02-07T17:33:54Z) - Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models [39.56233272612982]
Current vision large language models (VLLMs) exhibit remarkable capabilities yet are prone to generate harmful content and are vulnerable to jailbreaking attacks.
Our initial analysis finds that this is due to the presence of harmful data during vision-language instruction fine-tuning.
To address this issue, we first curate a vision-language safe instruction-following dataset VLGuard covering various harmful categories.
arXiv Detail & Related papers (2024-02-03T16:43:42Z) - Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations [29.32704733570445]
We introduce Llama Guard, an input-output safeguard model geared towards Human-AI conversation use cases.
Llama Guard incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks.
Llama Guard demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat.
arXiv Detail & Related papers (2023-12-07T19:40:50Z) - Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models.
We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.