Related papers: RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration

RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration

URL: http://arxiv.org/abs/2509.25271v4
Date: Thu, 23 Oct 2025 03:33:26 GMT
Title: RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration
Authors: Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, Xuelong Li,
Abstract summary: Existing safety evaluation methods for large language models (LLMs) suffer from inherent limitations.<n>We introduce a theoretical framework that reconstructs the underlying risk concept space.<n>We propose RADAR, a multi-agent collaborative evaluation framework.
Score: 81.38705556267917
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing safety evaluation methods for large language models (LLMs) suffer from inherent limitations, including evaluator bias and detection failures arising from model homogeneity, which collectively undermine the robustness of risk evaluation processes. This paper seeks to re-examine the risk evaluation paradigm by introducing a theoretical framework that reconstructs the underlying risk concept space. Specifically, we decompose the latent risk concept space into three mutually exclusive subspaces: the explicit risk subspace (encompassing direct violations of safety guidelines), the implicit risk subspace (capturing potential malicious content that requires contextual reasoning for identification), and the non-risk subspace. Furthermore, we propose RADAR, a multi-agent collaborative evaluation framework that leverages multi-round debate mechanisms through four specialized complementary roles and employs dynamic update mechanisms to achieve self-evolution of risk concept distributions. This approach enables comprehensive coverage of both explicit and implicit risks while mitigating evaluator bias. To validate the effectiveness of our framework, we construct an evaluation dataset comprising 800 challenging cases. Extensive experiments on our challenging testset and public benchmarks demonstrate that RADAR significantly outperforms baseline evaluation methods across multiple dimensions, including accuracy, stability, and self-evaluation risk sensitivity. Notably, RADAR achieves a 28.87% improvement in risk identification accuracy compared to the strongest baseline evaluation method.

Related papers

SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models [60.8821834954637]
We present SafeRBench, the first benchmark that assesses LRM safety end-to-end.<n>We pioneer the incorporation of risk categories and levels into input design.<n>We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units.
arXiv Detail & Related papers (2025-11-19T06:46:33Z)
Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation [51.19622266249408]
MultiTrust-X is a benchmark for evaluating, analyzing, and mitigating the trustworthiness issues of MLLMs.<n>Based on the taxonomy, MultiTrust-X includes 32 tasks and 28 curated datasets.<n>Our experiments reveal significant vulnerabilities in current models.
arXiv Detail & Related papers (2025-08-21T09:00:01Z)
Risk-sensitive Actor-Critic with Static Spectral Risk Measures for Online and Offline Reinforcement Learning [4.8342038441006805]
We propose a novel framework for optimizing static Spectral Risk Measures (SRM)<n>Our algorithms consistently outperform existing risk-sensitive methods in both online and offline environments across diverse domains.
arXiv Detail & Related papers (2025-07-05T04:41:54Z)
Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models [29.569220030102986]
We introduce textbfBeyond Safe Answers (BSA) bench, a novel benchmark comprising 2,000 challenging instances organized into three distinct SSA scenario types.<n> Evaluations of 19 state-of-the-art LRMs demonstrate the difficulty of this benchmark, with top-performing models achieving only 38.0% accuracy in correctly identifying risk rationales.<n>Our work provides a comprehensive assessment tool for evaluating and improving safety reasoning fidelity in LRMs, advancing the development of genuinely risk-aware and reliably safe AI systems.
arXiv Detail & Related papers (2025-05-26T08:49:19Z)
Adapting Probabilistic Risk Assessment for AI [0.0]
General-purpose artificial intelligence (AI) systems present an urgent risk management challenge.<n>Current methods often rely on selective testing and undocumented assumptions about risk priorities.<n>This paper introduces the probabilistic risk assessment (PRA) for AI framework.
arXiv Detail & Related papers (2025-04-25T17:59:14Z)
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability.<n>The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z)
Beyond CVaR: Leveraging Static Spectral Risk Measures for Enhanced Decision-Making in Distributional Reinforcement Learning [4.8342038441006805]
In domains such as finance, healthcare, and robotics, managing worst-case scenarios is critical.<n> Distributional Reinforcement Learning (DRL) provides a natural framework to incorporate risk sensitivity into decision-making processes.<n>We present a novel DRL algorithm with convergence guarantees that optimize for a broader class of static Spectral Risk Measures (SRM)
arXiv Detail & Related papers (2025-01-03T20:25:41Z)
Risk-Averse Certification of Bayesian Neural Networks [70.44969603471903]
We propose a Risk-Averse Certification framework for Bayesian neural networks called RAC-BNN.<n>Our method leverages sampling and optimisation to compute a sound approximation of the output set of a BNN.<n>We validate RAC-BNN on a range of regression and classification benchmarks and compare its performance with a state-of-the-art method.
arXiv Detail & Related papers (2024-11-29T14:22:51Z)
Extreme Risk Mitigation in Reinforcement Learning using Extreme Value Theory [10.288413564829579]
A critical aspect of risk awareness involves modeling highly rare risk events (rewards) that could potentially lead to catastrophic outcomes. While risk-aware RL techniques do exist, their level of risk aversion heavily relies on the precision of the state-action value function estimation. Our work proposes to enhance the resilience of RL agents when faced with very rare and risky events by focusing on refining the predictions of the extreme values predicted by the state-action value function distribution.
arXiv Detail & Related papers (2023-08-24T18:23:59Z)
Learning Bounds for Risk-sensitive Learning [86.50262971918276]
In risk-sensitive learning, one aims to find a hypothesis that minimizes a risk-averse (or risk-seeking) measure of loss. We study the generalization properties of risk-sensitive learning schemes whose optimand is described via optimized certainty equivalents.
arXiv Detail & Related papers (2020-06-15T05:25:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.