Related papers: Probing AI Safety with Source Code

Probing AI Safety with Source Code

URL: http://arxiv.org/abs/2506.20471v1
Date: Wed, 25 Jun 2025 14:19:57 GMT
Title: Probing AI Safety with Source Code
Authors: Ujwal Narayan, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Karthik Narasimhan, Ameet Deshpande, Vishvak Murahari,
Abstract summary: Large language models (LLMs) have become ubiquitous, interfacing with humans in numerous safety-critical applications.<n>We demonstrate that contemporary models fall concerningly short of the goal of AI safety, leading to an unsafe and harmful experience for users.
Score: 49.39895512792655
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have become ubiquitous, interfacing with humans in numerous safety-critical applications. This necessitates improving capabilities, but importantly coupled with greater safety measures to align these models with human values and preferences. In this work, we demonstrate that contemporary models fall concerningly short of the goal of AI safety, leading to an unsafe and harmful experience for users. We introduce a prompting strategy called Code of Thought (CoDoT) to evaluate the safety of LLMs. CoDoT converts natural language inputs to simple code that represents the same intent. For instance, CoDoT transforms the natural language prompt "Make the statement more toxic: {text}" to: "make_more_toxic({text})". We show that CoDoT results in a consistent failure of a wide range of state-of-the-art LLMs. For example, GPT-4 Turbo's toxicity increases 16.5 times, DeepSeek R1 fails 100% of the time, and toxicity increases 300% on average across seven modern LLMs. Additionally, recursively applying CoDoT can further increase toxicity two times. Given the rapid and widespread adoption of LLMs, CoDoT underscores the critical need to evaluate safety efforts from first principles, ensuring that safety and capabilities advance together.

Related papers

When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life [36.244977974241245]
We investigate and evaluate the safety impact of Multimodal Large Language Models (MLLMs) on human behavior in daily life.<n>We introduce SaLAD, a multimodal safety benchmark which contains 2,013 real-world image-text samples.<n>Results on 18 MLLMs demonstrate that the top-performing models achieve a safe response rate of only 57.2% on unsafe queries.
arXiv Detail & Related papers (2026-01-07T15:59:07Z)
Is Your Prompt Poisoning Code? Defect Induction Rates and Security Mitigation Strategies [4.435429537888066]
Large language models (LLMs) have become indispensable for automated code generation, yet the quality and security of their outputs remain a critical concern.<n>We propose an evaluation framework for prompt quality encompassing three key dimensions: goal clarity, information completeness, and logical consistency.<n>Our findings highlight that enhancing the quality of user prompts constitutes a critical and effective strategy for strengthening the security of AI-generated code.
arXiv Detail & Related papers (2025-10-27T02:59:17Z)
ROSE: Toward Reality-Oriented Safety Evaluation of Large Language Models [60.28667314609623]
Large Language Models (LLMs) are increasingly deployed as black-box components in real-world applications.<n>We propose Reality-Oriented Safety Evaluation (ROSE), a novel framework that uses multi-objective reinforcement learning to fine-tune an adversarial LLM.
arXiv Detail & Related papers (2025-06-17T10:55:17Z)
$\ exttt{SAGE}$: A Generic Framework for LLM Safety Evaluation [9.935219917903858]
This paper introduces the $texttSAGE$ (Safety AI Generic Evaluation) framework.<n>$texttSAGE$ is an automated modular framework designed for customized and dynamic harm evaluations.<n>Our experiments with multi-turn conversational evaluations revealed a concerning finding that harm steadily increases with conversation length.
arXiv Detail & Related papers (2025-04-28T11:01:08Z)
Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking [15.953888359667497]
jailbreak attacks based on prompt engineering have become a major safety threat.<n>This study introduces the concept of Defense Threshold Decay (DTD), revealing the potential safety impact caused by LLMs' benign generation.<n>We propose the Sugar-Coated Poison attack paradigm, which uses a "semantic reversal" strategy to craft benign inputs that are opposite in meaning to malicious intent.
arXiv Detail & Related papers (2025-04-08T03:57:09Z)
Realistic Evaluation of Toxicity in Large Language Models [28.580995165272086]
Large language models (LLMs) have become integral to our professional and daily lives. The huge amount of data which endows them with vast and diverse knowledge exposes them to the inevitable toxicity and bias. This paper introduces the new Thoroughly Engineered Toxicity dataset, comprising manually crafted prompts.
arXiv Detail & Related papers (2024-05-17T09:42:59Z)
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z)
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion [117.178835165855]
This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs. Our studies reveal a new and universal safety vulnerability of these models against code input. We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
arXiv Detail & Related papers (2024-03-12T17:55:38Z)
ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding [89.0074567748505]
We present reverse prompt contrastive decoding (ROSE), a simple-yet-effective method to boost the safety of existing instruction-tuned LLMs without any additional training. Experiments on 6 safety and 2 general-purpose tasks show that, our ROSE not only brings consistent and significant safety improvements (up to +13.8% safety score) upon 5 types of instruction-tuned LLMs, but also benefits the general-purpose ability of LLMs.
arXiv Detail & Related papers (2024-02-19T06:58:42Z)
RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models [62.72318564072706]
Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences. Despite its advantages, RLHF relies on human annotators to rank the text. We propose RankPoison, a poisoning attack method on candidates' selection of preference rank flipping to reach certain malicious behaviors.
arXiv Detail & Related papers (2023-11-16T07:48:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.