Related papers: SAFETY-J: Evaluating Safety with Critique

SAFETY-J: Evaluating Safety with Critique

URL: http://arxiv.org/abs/2407.17075v3
Date: Tue, 13 Aug 2024 10:59:17 GMT
Title: SAFETY-J: Evaluating Safety with Critique
Authors: Yixiu Liu, Yuxiang Zheng, Shijie Xia, Jiajun Li, Yi Tu, Chaoling Song, Pengfei Liu,
Abstract summary: We introduce SAFETY-J, a bilingual generative safety evaluator for English and Chinese with critique-based judgment. We establish an automated meta-evaluation benchmark that objectively assesses the quality of critiques with minimal human intervention. Our evaluations demonstrate that SAFETY-J provides more nuanced and accurate safety evaluations, thereby enhancing both critique quality and predictive reliability in complex content scenarios.
Score: 24.723999605458832
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The deployment of Large Language Models (LLMs) in content generation raises significant safety concerns, particularly regarding the transparency and interpretability of content evaluations. Current methods, primarily focused on binary safety classifications, lack mechanisms for detailed critique, limiting their utility for model improvement and user trust. To address these limitations, we introduce SAFETY-J, a bilingual generative safety evaluator for English and Chinese with critique-based judgment. SAFETY-J utilizes a robust training dataset that includes diverse dialogues and augmented query-response pairs to assess safety across various scenarios comprehensively. We establish an automated meta-evaluation benchmark that objectively assesses the quality of critiques with minimal human intervention, facilitating scalable and continuous improvement. Additionally, SAFETY-J employs an iterative preference learning technique to dynamically refine safety assessments based on meta-evaluations and critiques. Our evaluations demonstrate that SAFETY-J provides more nuanced and accurate safety evaluations, thereby enhancing both critique quality and predictive reliability in complex content scenarios. To facilitate further research and application, we open-source SAFETY-J's training protocols, datasets, and code at https://github.com/GAIR-NLP/Safety-J.

Related papers

The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs [42.57873562187369]
Large Language Models (LLMs) have demonstrated remarkable potential in the field of Natural Language Processing (NLP)<n>LLMs have occasionally exhibited unsafe elements like toxicity and bias, particularly in adversarial scenarios.<n>This survey aims to provide a comprehensive and systematic overview of recent advancements in LLMs safety evaluation.
arXiv Detail & Related papers (2025-06-06T05:50:50Z)
REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models [59.445672459851274]
REVAL is a comprehensive benchmark designed to evaluate the textbfREliability and textbfVALue of Large Vision-Language Models. REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability and Values. We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro.
arXiv Detail & Related papers (2025-03-20T07:54:35Z)
Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings [51.65890794988425]
This study presents the first comprehensive safety evaluation of the DeepSeek models. Our evaluation encompasses DeepSeek's latest generation of large language models, multimodal large language models, and text-to-image models.
arXiv Detail & Related papers (2025-03-19T10:44:37Z)
Safety Evaluation of DeepSeek Models in Chinese Contexts [12.297396865203973]
This study introduces CHiSafetyBench, a Chinese-specific safety evaluation benchmark. This benchmark systematically evaluates the safety of DeepSeek-R1 and DeepSeek-V3 in Chinese contexts. The experimental results quantify the deficiencies of these two models in Chinese contexts, providing key insights for subsequent improvements.
arXiv Detail & Related papers (2025-02-16T14:05:54Z)
ELITE: Enhanced Language-Image Toxicity Evaluation for Safety [22.371913404553545]
Current Vision Language Models (VLMs) remain vulnerable to malicious prompts that induce harmful outputs. Existing benchmarks have low levels of harmfulness, ambiguous data, and limited diversity in image-text pair combinations. We propose the ELITE benchmark, a high-quality safety evaluation benchmark for VLMs, underpinned by our enhanced evaluation method, the ELITE evaluator.
arXiv Detail & Related papers (2025-02-07T08:43:15Z)
Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning and text generation. LLMs can inadvertently generate unsafe or biased responses when prompted with problematic inputs. This research addresses the critical challenge of developing language models that generate both helpful and harmless content.
arXiv Detail & Related papers (2024-11-26T06:52:22Z)
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs. Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol. Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z)
Multimodal Situational Safety [73.63981779844916]
We present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety. For an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context. We develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.
arXiv Detail & Related papers (2024-10-08T16:16:07Z)
Feasibility Consistent Representation Learning for Safe Reinforcement Learning [25.258227763316228]
We introduce a novel framework named Feasibility Consistent Safe Reinforcement Learning (FCSRL) This framework combines representation learning with feasibility-oriented objectives to identify and extract safety-related information from the raw state for safe RL. Our method is capable of learning a better safety-aware embedding and achieving superior performance than previous representation learning baselines.
arXiv Detail & Related papers (2024-05-20T01:37:21Z)
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [107.82336341926134]
SALAD-Bench is a safety benchmark specifically designed for evaluating Large Language Models (LLMs) It transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.
arXiv Detail & Related papers (2024-02-07T17:33:54Z)
The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications. This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z)
Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes. To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z)
Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements [76.80453043969209]
This survey presents a framework for safety research pertaining to large models. We begin by introducing safety issues of wide concern, then delve into safety evaluation methods for large models. We explore the strategies for enhancing large model safety from training to deployment.
arXiv Detail & Related papers (2023-02-18T09:32:55Z)
Safety design concepts for statistical machine learning components toward accordance with functional safety standards [0.38073142980732994]
In recent years, curial incidents and accidents have been reported due to misjudgment of statistical machine learning. In this paper, we organize five kinds of technical safety concepts (TSCs) for components toward accordance with functional safety standards.
arXiv Detail & Related papers (2020-08-04T01:01:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.