Related papers: CBMAS: Cognitive Behavioral Modeling via Activation Steering

CBMAS: Cognitive Behavioral Modeling via Activation Steering

URL: http://arxiv.org/abs/2601.06109v1
Date: Sat, 03 Jan 2026 13:04:14 GMT
Title: CBMAS: Cognitive Behavioral Modeling via Activation Steering
Authors: Ahmed H. Ismail, Anthony Kuang, Ayo Akinkugbe, Kevin Zhu, Sean O'Brien,
Abstract summary: Large language models (LLMs) often encode cognitive behaviors unpredictably across prompts, layers, and contexts.<n>We present CBMAS, a diagnostic framework for continuous activation steering.
Score: 5.131778762865578
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) often encode cognitive behaviors unpredictably across prompts, layers, and contexts, making them difficult to diagnose and control. We present CBMAS, a diagnostic framework for continuous activation steering, which extends cognitive bias analysis from discrete before/after interventions to interpretable trajectories. By combining steering vector construction with dense α-sweeps, logit lens-based bias curves, and layer-site sensitivity analysis, our approach can reveal tipping points where small intervention strengths flip model behavior and show how steering effects evolve across layer depth. We argue that these continuous diagnostics offer a bridge between high-level behavioral evaluation and low-level representational dynamics, contributing to the cognitive interpretability of LLMs. Lastly, we provide a CLI and datasets for various cognitive behaviors at the project repository, https://github.com/shimamooo/CBMAS.

Related papers

PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding [85.22047087898311]
We introduce Polarity-Prompt Contrastive Decoding (PromptCD), a test-time behavior control method that generalizes contrastive decoding to broader enhancement settings.<n>PromptCD constructs paired positive and negative guiding prompts for a target behavior and contrasts model responses to reinforce desirable outcomes.<n>Experiments on the "3H" alignment objectives demonstrate consistent and substantial improvements, indicating that post-trained models can achieve meaningful self-enhancement purely at test time.
arXiv Detail & Related papers (2026-02-24T08:56:52Z)
From Latent Signals to Reflection Behavior: Tracing Meta-Cognitive Activation Trajectory in R1-Style LLMs [48.33546389897804]
R1-style LLMs have attracted growing attention for their capacity for self-reflection, yet the internal mechanisms underlying such behavior remain unclear.<n>Using the logit lens to read out token-level semantics, we uncover a structured progression.<n>Our findings suggest a human-like metacognitive process-progressing from latent monitoring, to discourse-level regulation, and to finally overt self-reflection.
arXiv Detail & Related papers (2026-02-02T11:58:24Z)
Mechanistic Indicators of Steering Effectiveness in Large Language Models [3.635648354808971]
Activation-based steering enables Large Language Models to exhibit targeted behaviors by intervening on intermediate activations without retraining.<n>Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood.<n>We investigate whether the reliability of steering can be diagnosed using internal model signals.
arXiv Detail & Related papers (2026-02-02T06:56:22Z)
Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering [33.3588824921581]
We introduce Vision-Language Introspection (VLI), a training-free inference framework that simulates a metacognitive self-correction process.<n>VLI achieves state-of-the-art performance on advanced models, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.
arXiv Detail & Related papers (2026-01-08T17:49:13Z)
Cognitive Mirrors: Exploring the Diverse Functional Roles of Attention Heads in LLM Reasoning [54.12174882424842]
Large language models (LLMs) have achieved state-of-the-art performance in a variety of tasks, but remain largely opaque in terms of their internal mechanisms.<n>We propose a novel interpretability framework to systematically analyze the roles and behaviors of attention heads.
arXiv Detail & Related papers (2025-12-03T10:24:34Z)
Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs [10.99947795031516]
Large Language Models exhibit implicit personalities in their generation, but reliably controlling or aligning these traits to meet specific needs remains an open challenge.<n>We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits.<n>Our findings reveal that personality traits occupy a low-rank shared subspace, and that these latent structures can be transformed into actionable mechanisms for effective steering.
arXiv Detail & Related papers (2025-10-29T05:56:39Z)
RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis [56.373297358647655]
Retrieval-Augmented Diagnosis (RAD) is a novel framework that injects external knowledge into multimodal models directly on downstream tasks.<n>RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss transformer, and a dual decoder.
arXiv Detail & Related papers (2025-09-24T10:36:14Z)
Fuzzy, Symbolic, and Contextual: Enhancing LLM Instruction via Cognitive Scaffolding [3.553493344868413]
We study how prompt-level inductive biases influence the cognitive behavior of large language models (LLMs) in instructional dialogue.<n>We introduce a symbolic scaffolding method paired with a short-term memory schema designed to promote adaptive, structured reasoning.<n>Preliminary results show that our full system consistently outperforms baseline variants.
arXiv Detail & Related papers (2025-08-28T20:46:13Z)
Revealing Neurocognitive and Behavioral Patterns by Unsupervised Manifold Learning from Dynamic Brain Data [29.522638794625536]
This paper introduces a generalizable unsupervised deep manifold learning for exploration of neurocognitive and behavioral patterns.<n>The proposed Brain-dynamic Convolutional-Network-based Embedding (BCNE) seeks to capture the brain-state trajectories by deciphering temporospatial correlations within the data.<n>The results, both visual and quantitative, reveal a diverse array of intriguing and interpretable patterns.
arXiv Detail & Related papers (2025-08-07T23:36:52Z)
Dynamic Programming Techniques for Enhancing Cognitive Representation in Knowledge Tracing [125.75923987618977]
We propose the Cognitive Representation Dynamic Programming based Knowledge Tracing (CRDP-KT) model.<n>It is a dynamic programming algorithm to optimize cognitive representations based on the difficulty of the questions and the performance intervals between them.<n>It provides more accurate and systematic input features for subsequent model training, thereby minimizing distortion in the simulation of cognitive states.
arXiv Detail & Related papers (2025-06-03T14:44:48Z)
Interpretable Spatio-Temporal Embedding for Brain Structural-Effective Network with Ordinary Differential Equation [56.34634121544929]
In this study, we first construct the brain-effective network via the dynamic causal model. We then introduce an interpretable graph learning framework termed Spatio-Temporal Embedding ODE (STE-ODE) This framework incorporates specifically designed directed node embedding layers, aiming at capturing the dynamic interplay between structural and effective networks.
arXiv Detail & Related papers (2024-05-21T20:37:07Z)
Interpretable Imitation Learning with Dynamic Causal Relations [65.18456572421702]
We propose to expose captured knowledge in the form of a directed acyclic causal graph. We also design this causal discovery process to be state-dependent, enabling it to model the dynamics in latent causal graphs. The proposed framework is composed of three parts: a dynamic causal discovery module, a causality encoding module, and a prediction module, and is trained in an end-to-end manner.
arXiv Detail & Related papers (2023-09-30T20:59:42Z)
An Explainable Model for EEG Seizure Detection based on Connectivity Features [0.0]
We propose to learn a deep neural network that detects whether a particular data window belongs to a seizure or not. Taking our data as a sequence of ten sub-windows, we aim at designing an optimal deep learning model using attention, CNN, BiLstm, and fully connected layers. Our best model architecture resulted in 97.03% accuracy using balanced MITBIH data subset.
arXiv Detail & Related papers (2020-09-26T11:07:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.