Related papers: Evaluating Proactive Risk Awareness of Large Language Models

Evaluating Proactive Risk Awareness of Large Language Models

URL: http://arxiv.org/abs/2602.20976v1
Date: Tue, 24 Feb 2026 15:00:00 GMT
Title: Evaluating Proactive Risk Awareness of Large Language Models
Authors: Xuan Luo, Yubin Chen, Zhiyu Hou, Linpu Yu, Geng Tu, Jing Li, Ruifeng Xu,
Abstract summary: We introduce a proactive risk awareness evaluation framework that measures whether large language models can anticipate potential harms and provide warnings before damage occurs.<n>We construct the Butterfly dataset to instantiate this framework in the environmental and ecological domain.
Score: 30.312744244385822
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As large language models (LLMs) are increasingly embedded in everyday decision-making, their safety responsibilities extend beyond reacting to explicit harmful intent toward anticipating unintended but consequential risks. In this work, we introduce a proactive risk awareness evaluation framework that measures whether LLMs can anticipate potential harms and provide warnings before damage occurs. We construct the Butterfly dataset to instantiate this framework in the environmental and ecological domain. It contains 1,094 queries that simulate ordinary solution-seeking activities whose responses may induce latent ecological impact. Through experiments across five widely used LLMs, we analyze the effects of response length, languages, and modality. Experimental results reveal consistent, significant declines in proactive awareness under length-restricted responses, cross-lingual similarities, and persistent blind spots in (multimodal) species protection. These findings highlight a critical gap between current safety alignment and the requirements of real-world ecological responsibility, underscoring the need for proactive safeguards in LLM deployment.

Related papers

When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life [36.244977974241245]
We investigate and evaluate the safety impact of Multimodal Large Language Models (MLLMs) on human behavior in daily life.<n>We introduce SaLAD, a multimodal safety benchmark which contains 2,013 real-world image-text samples.<n>Results on 18 MLLMs demonstrate that the top-performing models achieve a safe response rate of only 57.2% on unsafe queries.
arXiv Detail & Related papers (2026-01-07T15:59:07Z)
MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Risks in LLMs on Domain Tasks [17.598413159363393]
Current alignment efforts primarily target explicit risks such as bias, hate speech, and violence.<n>We propose MENTOR: A MEtacognition-driveN self-evoluTion framework for uncOvering and mitigating implicit risks in large language models.<n>We release a supporting dataset of 9,000 risk queries spanning education, finance, and management to enhance domain-specific risk identification.
arXiv Detail & Related papers (2025-11-10T13:51:51Z)
Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation [69.63626052852153]
We propose a proof-of-concept framework that projects how model-generated advice could propagate through societal systems.<n>We also introduce a dataset of 100 indirect harm scenarios, testing models' ability to foresee adverse, non-obvious outcomes from seemingly harmless user prompts.
arXiv Detail & Related papers (2025-06-26T02:28:58Z)
ROSE: Toward Reality-Oriented Safety Evaluation of Large Language Models [60.28667314609623]
Large Language Models (LLMs) are increasingly deployed as black-box components in real-world applications.<n>We propose Reality-Oriented Safety Evaluation (ROSE), a novel framework that uses multi-objective reinforcement learning to fine-tune an adversarial LLM.
arXiv Detail & Related papers (2025-06-17T10:55:17Z)
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge [11.63268709958876]
SOSBench is a regulation-grounded, hazard-focused benchmark for large language models.<n>It covers six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology.<n>We evaluate frontier models within a unified evaluation framework using our SOSBench.
arXiv Detail & Related papers (2025-05-27T17:47:08Z)
A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy [31.839815402460918]
Large language models (LLMs) present significant potential for supporting numerous real-world applications.<n>They still face significant challenges in terms of the inherent risk of privacy leakage, hallucinated outputs, and value misalignment.
arXiv Detail & Related papers (2025-01-16T09:59:45Z)
Risk-Averse Finetuning of Large Language Models [15.147772383812313]
We propose integrating risk-averse principles into Large Language Models (LLMs) fine-tuning to minimize the occurrence of harmful outputs.<n> Empirical evaluations on sentiment modification and toxicity mitigation tasks demonstrate the efficacy of risk-averse reinforcement learning with human feedback.
arXiv Detail & Related papers (2025-01-12T19:48:21Z)
Persuasion with Large Language Models: a Survey [49.86930318312291]
Large Language Models (LLMs) have created new disruptive possibilities for persuasive communication. In areas such as politics, marketing, public health, e-commerce, and charitable giving, such LLM Systems have already achieved human-level or even super-human persuasiveness. Our survey suggests that the current and future potential of LLM-based persuasion poses profound ethical and societal risks.
arXiv Detail & Related papers (2024-11-11T10:05:52Z)
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs. Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol. Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z)
A Chinese Dataset for Evaluating the Safeguards in Large Language Models [46.43476815725323]
Large language models (LLMs) can produce harmful responses. This paper introduces a dataset for the safety evaluation of Chinese LLMs. We then extend it to two other scenarios that can be used to better identify false negative and false positive examples.
arXiv Detail & Related papers (2024-02-19T14:56:18Z)
Risks of AI Scientists: Prioritizing Safeguarding Over Autonomy [65.77763092833348]
This perspective examines vulnerabilities in AI scientists, shedding light on potential risks associated with their misuse.<n>We take into account user intent, the specific scientific domain, and their potential impact on the external environment.<n>We propose a triadic framework involving human regulation, agent alignment, and an understanding of environmental feedback.
arXiv Detail & Related papers (2024-02-06T18:54:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.