SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model
- URL: http://arxiv.org/abs/2406.12030v1
- Date: Mon, 17 Jun 2024 18:57:37 GMT
- Title: SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model
- Authors: Yongting Zhang, Lu Chen, Guodong Zheng, Yifeng Gao, Rui Zheng, Jinlan Fu, Zhenfei Yin, Senjie Jin, Yu Qiao, Xuanjing Huang, Feng Zhao, Tao Gui, Jing Shao,
- Abstract summary: We propose a Safety Preference Alignment dataset for Vision Language Models named SPA-VL.
In terms of breadth, SPA-VL covers 6 harmfulness domains, 13 categories, and 53 subcategories, and contains 100,788 samples of the quadruple (question, image, chosen response, rejected response)
The experimental results indicate that models trained with alignment techniques on the SPA-VL dataset exhibit substantial improvements in harmlessness and helpfulness while maintaining core capabilities.
- Score: 77.86593720792986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The emergence of Vision Language Models (VLMs) has brought unprecedented advances in understanding multimodal information. The combination of textual and visual semantics in VLMs is highly complex and diverse, making the safety alignment of these models challenging. Furthermore, due to the limited study on the safety alignment of VLMs, there is a lack of large-scale, high-quality datasets. To address these limitations, we propose a Safety Preference Alignment dataset for Vision Language Models named SPA-VL. In terms of breadth, SPA-VL covers 6 harmfulness domains, 13 categories, and 53 subcategories, and contains 100,788 samples of the quadruple (question, image, chosen response, rejected response). In terms of depth, the responses are collected from 12 open- (e.g., QwenVL) and closed-source (e.g., Gemini) VLMs to ensure diversity. The experimental results indicate that models trained with alignment techniques on the SPA-VL dataset exhibit substantial improvements in harmlessness and helpfulness while maintaining core capabilities. SPA-VL, as a large-scale, high-quality, and diverse dataset, represents a significant milestone in ensuring that VLMs achieve both harmlessness and helpfulness. We have made our code https://github.com/EchoseChen/SPA-VL-RLHF and SPA-VL dataset url https://huggingface.co/datasets/sqrti/SPA-VL publicly available.
Related papers
- Concept-skill Transferability-based Data Selection for Large Vision-Language Models [56.0725292404808]
We introduce COINCIDE, an effective and scalable data selection technique for training vision-language models.
We cluster the training data using internal activations from a small model, which identifies concept-skill compositions needed by a target LVLM.
Experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines.
arXiv Detail & Related papers (2024-06-16T16:15:20Z) - LLavaGuard: VLM-based Safeguards for Vision Dataset Curation and Safety Assessment [26.148022772521493]
We introduce LlavaGuard, a family of VLM-based safeguard models.
LlavaGuard offers a versatile framework for evaluating the safety compliance of visual content.
Our experiments highlight the capabilities of LlavaGuard in complex and real-world applications.
arXiv Detail & Related papers (2024-06-07T17:44:32Z) - What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases [87.65903426052155]
We perform a large-scale transfer learning experiment aimed at discovering latent vision-language skills from data.
We show that generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths.
We present a new dataset, OLIVE, which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested.
arXiv Detail & Related papers (2024-04-03T02:40:35Z) - VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning [12.450293825734313]
Large language models (LLMs) famously exhibit emergent in-context learning (ICL)
This study introduces a benchmark VL-ICL Bench for multimodal in-context learning.
We evaluate the abilities of state-of-the-art VLLMs against this benchmark suite.
arXiv Detail & Related papers (2024-03-19T21:31:56Z) - AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Instructions [52.9787902653558]
Large Vision-Language Models (LVLMs) have shown significant progress in well responding to visual-instructions from users.
Despite the critical importance of LVLMs' robustness against such threats, current research in this area remains limited.
We introduce AVIBench, a framework designed to analyze the robustness of LVLMs when facing various adversarial visual-instructions.
arXiv Detail & Related papers (2024-03-14T12:51:07Z) - Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models [39.56233272612982]
Current vision large language models (VLLMs) exhibit remarkable capabilities yet are prone to generate harmful content and are vulnerable to jailbreaking attacks.
Our initial analysis finds that this is due to the presence of harmful data during vision-language instruction fine-tuning.
To address this issue, we first curate a vision-language safe instruction-following dataset VLGuard covering various harmful categories.
arXiv Detail & Related papers (2024-02-03T16:43:42Z) - LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z) - Vision-Language Instruction Tuning: A Review and Analysis [52.218690619616474]
Vision-Language Instruction Tuning (VLIT) presents more complex characteristics compared to pure text instruction tuning.
We offer a detailed categorization for existing VLIT datasets and identify the characteristics that high-quality VLIT data should possess.
By incorporating these characteristics as guiding principles into the existing VLIT data construction process, we conduct extensive experiments and verify their positive impact on the performance of tuned multi-modal LLMs.
arXiv Detail & Related papers (2023-11-14T14:02:32Z) - Teaching Structured Vision&Language Concepts to Vision&Language Models [46.344585368641006]
We introduce the collective notion of Structured Vision&Language Concepts (SVLC)
SVLC includes object attributes, relations, and states which are present in the text and visible in the image.
We propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs.
arXiv Detail & Related papers (2022-11-21T18:54:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.