Towards Evaluating Proactive Risk Awareness of Multimodal Language Models
- URL: http://arxiv.org/abs/2505.17455v1
- Date: Fri, 23 May 2025 04:28:47 GMT
- Title: Towards Evaluating Proactive Risk Awareness of Multimodal Language Models
- Authors: Youliang Yuan, Wenxiang Jiao, Yuejin Xie, Chihao Shen, Menghan Tian, Wenxuan Wang, Jen-tse Huang, Pinjia He,
- Abstract summary: A proactive safety artificial intelligence (AI) system would work better than a reactive one.<n>PaSBench evaluates this capability through 416 multimodal scenarios.<n>Top performers like Gemini-2.5-pro achieve 71% image and 64% text accuracy, but miss 45-55% risks in repeated trials.
- Score: 38.55193215852595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human safety awareness gaps often prevent the timely recognition of everyday risks. In solving this problem, a proactive safety artificial intelligence (AI) system would work better than a reactive one. Instead of just reacting to users' questions, it would actively watch people's behavior and their environment to detect potential dangers in advance. Our Proactive Safety Bench (PaSBench) evaluates this capability through 416 multimodal scenarios (128 image sequences, 288 text logs) spanning 5 safety-critical domains. Evaluation of 36 advanced models reveals fundamental limitations: Top performers like Gemini-2.5-pro achieve 71% image and 64% text accuracy, but miss 45-55% risks in repeated trials. Through failure analysis, we identify unstable proactive reasoning rather than knowledge deficits as the primary limitation. This work establishes (1) a proactive safety benchmark, (2) systematic evidence of model limitations, and (3) critical directions for developing reliable protective AI. We believe our dataset and findings can promote the development of safer AI assistants that actively prevent harm rather than merely respond to requests. Our dataset can be found at https://huggingface.co/datasets/Youliang/PaSBench.
Related papers
- FORTRESS: Frontier Risk Evaluation for National Security and Public Safety [5.544163262906087]
Current benchmarks often fail to test safeguard robustness to potential national security and public safety risks.<n>We introduce FORTRESS: 500 expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary questions.<n>Each prompt-rubric pair has a corresponding benign version to test for model over-refusals.
arXiv Detail & Related papers (2025-06-17T19:08:02Z) - Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models [29.569220030102986]
We introduce textbfBeyond Safe Answers (BSA) bench, a novel benchmark comprising 2,000 challenging instances organized into three distinct SSA scenario types.<n> Evaluations of 19 state-of-the-art LRMs demonstrate the difficulty of this benchmark, with top-performing models achieving only 38.0% accuracy in correctly identifying risk rationales.<n>Our work provides a comprehensive assessment tool for evaluating and improving safety reasoning fidelity in LRMs, advancing the development of genuinely risk-aware and reliably safe AI systems.
arXiv Detail & Related papers (2025-05-26T08:49:19Z) - ReasoningShield: Content Safety Detection over Reasoning Traces of Large Reasoning Models [19.963759799471568]
Reasoning Models (LRMs) are transforming the AI landscape with advanced reasoning capabilities.<n>While the generated reasoning traces enhance model transparency, they can still contain unsafe content, even when the final answer appears safe.<n>Existing moderation tools, primarily designed for question-answer (QA) pairs, are empirically ineffective at detecting hidden risks embedded in reasoning traces.<n>We propose ReasoningShield, the first safety detection model tailored to identify potential risks in the reasoning trace before reaching the final answer.
arXiv Detail & Related papers (2025-05-22T19:44:41Z) - SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models [50.34706204154244]
Acquiring reasoning capabilities catastrophically degrades inherited safety alignment.<n>Certain scenarios suffer 25 times higher attack rates.<n>Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction.
arXiv Detail & Related papers (2025-04-09T06:53:23Z) - Assessing confidence in frontier AI safety cases [37.839615078345886]
A safety case presents a structured argument in support of a top-level claim about a safety property of the system.<n>This raises the question of what level of confidence should be associated with a top-level claim.<n>We propose a method by which AI developers can prioritise, and thereby make their investigation of argument defeaters more efficient.
arXiv Detail & Related papers (2025-02-09T06:35:11Z) - OpenAI o1 System Card [274.83891368890977]
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought.<n>This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
arXiv Detail & Related papers (2024-12-21T18:04:31Z) - Quantifying detection rates for dangerous capabilities: a theoretical model of dangerous capability evaluations [47.698233647783965]
We present a quantitative model for tracking dangerous AI capabilities over time.<n>Our goal is to help the policy and research community visualise how dangerous capability testing can give us an early warning about approaching AI risks.
arXiv Detail & Related papers (2024-12-19T22:31:34Z) - What AI evaluations for preventing catastrophic risks can and cannot do [2.07180164747172]
We argue that evaluations face fundamental limitations that cannot be overcome within the current paradigm.<n>This means that while evaluations are valuable tools, we should not rely on them as our main way of ensuring AI systems are safe.
arXiv Detail & Related papers (2024-11-26T18:00:36Z) - Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.
We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.
We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z) - EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents [53.717918131568936]
Embodied artificial intelligence (EAI) integrates advanced AI models into physical entities for real-world interaction.<n>Foundation models as the "brain" of EAI agents for high-level task planning have shown promising results.<n>However, the deployment of these agents in physical environments presents significant safety challenges.<n>This study introduces EARBench, a novel framework for automated physical risk assessment in EAI scenarios.
arXiv Detail & Related papers (2024-08-08T13:19:37Z) - Evaluating Frontier Models for Dangerous Capabilities [59.129424649740855]
We introduce a programme of "dangerous capability" evaluations and pilot them on Gemini 1.0 models.
Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning.
Our goal is to help advance a rigorous science of dangerous capability evaluation, in preparation for future models.
arXiv Detail & Related papers (2024-03-20T17:54:26Z) - Coordinated pausing: An evaluation-based coordination scheme for
frontier AI developers [0.2913760942403036]
This paper focuses on one possible response: coordinated pausing.
It proposes an evaluation-based coordination scheme that consists of five main steps.
It concludes that coordinated pausing is a promising mechanism for tackling emerging risks from frontier AI models.
arXiv Detail & Related papers (2023-09-30T13:38:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.