Related papers: CoDefeater: Using LLMs To Find Defeaters in Assurance Cases

Related papers

Making LLMs Reliable When It Matters Most: A Five-Layer Architecture for High-Stakes Decisions [51.56484100374058]
Current large language models (LLMs) excel in verifiable domains where outputs can be checked before action but prove less reliable for high-stakes strategic decisions with uncertain outcomes.<n>This gap, driven by mutually cognitive biases in both humans and artificial intelligence (AI) systems, threatens the defensibility of valuations and sustainability of investments in the sector.<n>This report describes a framework emerging from systematic qualitative assessment across 7 frontier-grade LLMs and 3 market-facing venture vignettes under time pressure.
arXiv Detail & Related papers (2025-11-10T22:24:21Z)
When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails [74.63933201261595]
Large Reasoning Models (LRMs) demonstrate remarkable capabilities on complex reasoning tasks.<n>LRMs remain vulnerable to severe safety risks, including harmful content generation and jailbreak attacks.<n>We propose the Chain-of-Guardrail (CoG), a training framework that recomposes or backtracks unsafe reasoning steps.
arXiv Detail & Related papers (2025-10-24T09:32:25Z)
Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs [10.896368527058714]
Large language models (LLMs) frequently produce false refusals, declining benign requests that contain terms resembling unsafe queries.<n>We introduce two comprehensive benchmarks: the Exaggerated Safety Benchmark (XSB) for single-turn prompts, annotated with "Focus" keywords that identify refusal-inducing triggers, and the Multi-turn Scenario-based Exaggerated Safety Benchmark (MS-XSB)<n>Our benchmarks reveal that exaggerated refusals persist across diverse recent LLMs and are especially pronounced in complex, multi-turn scenarios.
arXiv Detail & Related papers (2025-10-09T12:38:16Z)
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning [49.47193675702453]
Large Language Models (LLMs) have demonstrated remarkable generative capabilities.<n>LLMs remain vulnerable to malicious instructions that can bypass safety constraints.<n>We propose a reasoning-based safety alignment framework, ARMOR, that replaces the ad-hoc chains of thought reasoning process with human-aligned, structured one.
arXiv Detail & Related papers (2025-07-14T09:05:54Z)
Benchmarking Misuse Mitigation Against Covert Adversaries [80.74502950627736]
Existing language model safety evaluations focus on overt attacks and low-stakes tasks.<n>We develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses.<n>Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.
arXiv Detail & Related papers (2025-06-06T17:33:33Z)
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z)
Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics [5.384257830522198]
Large Language Models (LLMs) in critical applications have introduced severe reliability and security risks. These vulnerabilities have been weaponized by malicious actors, leading to unauthorized access, widespread misinformation, and compromised system integrity. We introduce a novel approach to detecting abnormal behaviors in LLMs via hidden state forensics.
arXiv Detail & Related papers (2025-04-01T05:58:14Z)
Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
arXiv Detail & Related papers (2025-02-03T18:59:01Z)
A Taxonomy of Real-World Defeaters in Safety Assurance Cases [4.4398355848251745]
The software engineering community could benefit from having a reusable classification of real-world defeaters in software assurance cases. We derived a taxonomy with seven broad categories, laying the groundwork for standardizing the analysis and management of defeaters in safety-critical systems.
arXiv Detail & Related papers (2025-02-01T00:38:41Z)
The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense [56.32083100401117]
We investigate why Vision Large Language Models (VLLMs) are prone to jailbreak attacks. We then make a key observation: existing defense mechanisms suffer from an textbfover-prudence problem. We find that the two representative evaluation methods for jailbreak often exhibit chance agreement.
arXiv Detail & Related papers (2024-11-13T07:57:19Z)
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs. Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol. Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z)
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level [10.658844160259104]
Large language models (LLMs) have demonstrated immense utility across various industries. As LLMs advance, the risk of harmful outputs increases due to incorrect or malicious instruction prompts. This paper examines the LLMs' capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens.
arXiv Detail & Related papers (2024-10-09T12:09:30Z)
Automatic Instantiation of Assurance Cases from Patterns Using Large Language Models [6.314768437420443]
Large Language Models (LLMs) can generate assurance cases that comply with specific patterns. LLMs exhibit potential in the automatic generation of assurance cases, but their capabilities still fall short compared to human experts.
arXiv Detail & Related papers (2024-10-07T20:58:29Z)
Automating Semantic Analysis of System Assurance Cases using Goal-directed ASP [1.2189422792863451]
We present our approach to enhancing Assurance 2.0 with semantic rule-based analysis capabilities. We examine the unique semantic aspects of assurance cases, such as logical consistency, adequacy, indefeasibility, etc.
arXiv Detail & Related papers (2024-08-21T15:22:43Z)
InferAct: Inferring Safe Actions for LLM-Based Agents Through Preemptive Evaluation and Human Feedback [70.54226917774933]
This paper introduces InferAct, a novel approach to proactively detect potential errors before risky actions are executed. InferAct acts as a human proxy, detecting unsafe actions and alerting users for intervention. Experiments on three widely-used tasks demonstrate the effectiveness of InferAct.
arXiv Detail & Related papers (2024-07-16T15:24:44Z)
A PRISMA-Driven Bibliometric Analysis of the Scientific Literature on Assurance Case Patterns [7.930875992631788]
Assurance cases can be used to prevent system failure. They are structured arguments that allow arguing and relaying various safety-critical systems' requirements.
arXiv Detail & Related papers (2024-07-06T05:00:49Z)
Jailbreaking as a Reward Misspecification Problem [80.52431374743998]
We propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness. We present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space.
arXiv Detail & Related papers (2024-06-20T15:12:27Z)
Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations [76.19419888353586]
Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. We present our efforts to create and deploy a library of detectors: compact and easy-to-build classification models that provide labels for various harms.
arXiv Detail & Related papers (2024-03-09T21:07:16Z)
On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO. Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities. Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content. We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.