Annotating the Chain-of-Thought: A Behavior-Labeled Dataset for AI Safety
- URL: http://arxiv.org/abs/2510.18154v1
- Date: Mon, 20 Oct 2025 23:12:12 GMT
- Title: Annotating the Chain-of-Thought: A Behavior-Labeled Dataset for AI Safety
- Authors: Antonio-Gabriel Chacón Menke, Phan Xuan Tan, Eiji Kamioka,
- Abstract summary: We present a sentence-level labeled dataset that enables activation-based monitoring of safety behaviors.<n>Our dataset contains reasoning sequences with sentence-level annotations of safety behaviors such as expression of safety concerns or speculation on user intent.<n>We demonstrate the dataset's utility by extracting representations that both detect and steer safety behaviors in model activations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work has highlighted the importance of monitoring chain-of-thought reasoning for AI safety; however, current approaches that analyze textual reasoning steps can miss subtle harmful patterns and may be circumvented by models that hide unsafe reasoning. We present a sentence-level labeled dataset that enables activation-based monitoring of safety behaviors during LLM reasoning. Our dataset contains reasoning sequences with sentence-level annotations of safety behaviors such as expression of safety concerns or speculation on user intent, which we use to extract steering vectors for detecting and influencing these behaviors within model activations. The dataset fills a key gap in safety research: while existing datasets label reasoning holistically, effective application of steering vectors for safety monitoring could be improved by identifying precisely when specific behaviors occur within reasoning chains. We demonstrate the dataset's utility by extracting representations that both detect and steer safety behaviors in model activations, showcasing the potential of activation-level techniques for improving safety oversight on reasoning. Content Warning: This paper discusses AI safety in the context of harmful prompts and may contain references to potentially harmful content.
Related papers
- Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models [62.16655896700062]
Activation steering is a technique to enhance the utility of Large Language Models (LLMs)<n>We show that it unintentionally introduces critical and under-explored safety risks.<n>Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks.
arXiv Detail & Related papers (2026-02-03T12:32:35Z) - GAVEL: Towards rule-based safety through activation monitoring [2.337566423505956]
Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors.<n>Existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability.<n>This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity.
arXiv Detail & Related papers (2026-01-27T16:31:39Z) - Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process [66.38541693477181]
We propose an unsupervised framework for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors.<n>By segmenting chain-of-thought traces into sentence-level'steps', we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking.<n>We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space.
arXiv Detail & Related papers (2025-12-30T05:09:11Z) - When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails [74.63933201261595]
Large Reasoning Models (LRMs) demonstrate remarkable capabilities on complex reasoning tasks.<n>LRMs remain vulnerable to severe safety risks, including harmful content generation and jailbreak attacks.<n>We propose the Chain-of-Guardrail (CoG), a training framework that recomposes or backtracks unsafe reasoning steps.
arXiv Detail & Related papers (2025-10-24T09:32:25Z) - LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation [4.29885665563186]
LATENTGUARD is a framework that combines behavioral alignment with supervised latent space control for interpretable and precise safety steering.<n>Our results show significant improvements in both safety controllability and response interpretability without compromising utility.
arXiv Detail & Related papers (2025-09-24T07:31:54Z) - SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs [7.120986296945107]
This paper investigates an approach called SafeSteer for guiding the outputs of large language models (LLMs)<n>We employ a simple, gradient-free unsupervised method to enhance safety steering while preserving text quality, topic relevance, and without explicit refusal.
arXiv Detail & Related papers (2025-06-01T01:19:37Z) - Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z) - Advancing Neural Network Verification through Hierarchical Safety Abstract Interpretation [52.626086874715284]
We introduce a novel problem formulation called Abstract DNN-Verification, which verifies a hierarchical structure of unsafe outputs.<n>By leveraging abstract interpretation and reasoning about output reachable sets, our approach enables assessing multiple safety levels during the formal verification process.<n>Our contributions include a theoretical exploration of the relationship between our novel abstract safety formulation and existing approaches.
arXiv Detail & Related papers (2025-05-08T13:29:46Z) - The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models [10.524960491460945]
Fine-tuning attacks can exploit large language models to reveal potentially harmful behaviours.<n>This paper investigates the performance of the Chain of Thought based reasoning model, DeepSeek, when subjected to fine-tuning attacks.<n>We aim to shed light on the vulnerability of Chain of Thought enabled models to fine-tuning attacks and the implications for their safety and ethical deployment.
arXiv Detail & Related papers (2025-02-03T10:28:26Z) - SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior [56.10557932893919]
We present SafetyAnalyst, a novel AI safety moderation framework.<n>Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences.<n>It aggregates effects into a harmfulness score using 28 fully interpretable weight parameters.
arXiv Detail & Related papers (2024-10-22T03:38:37Z) - SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large Language Models (LLMs) to defend threats from malicious instructions.<n>Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.<n>We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z) - A Flow-based Credibility Metric for Safety-critical Pedestrian Detection [16.663568842153065]
Safety is of utmost importance for perception in automated driving (AD)
Standard evaluation schemes utilize safety-agnostic metrics to argue sufficient detection performance.
This paper introduces a novel credibility metric, called c-flow, for pedestrian bounding boxes.
arXiv Detail & Related papers (2024-02-12T13:30:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.