Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs
- URL: http://arxiv.org/abs/2511.03738v1
- Date: Wed, 29 Oct 2025 05:56:39 GMT
- Title: Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs
- Authors: Pranav Bhandari, Nicolas Fay, Sanjeevan Selvaganapathy, Amitava Datta, Usman Naseem, Mehwish Nasim,
- Abstract summary: Large Language Models exhibit implicit personalities in their generation, but reliably controlling or aligning these traits to meet specific needs remains an open challenge.<n>We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits.<n>Our findings reveal that personality traits occupy a low-rank shared subspace, and that these latent structures can be transformed into actionable mechanisms for effective steering.
- Score: 10.99947795031516
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models exhibit implicit personalities in their generation, but reliably controlling or aligning these traits to meet specific needs remains an open challenge. The need for effective mechanisms for behavioural manipulation of the model during generation is a critical gap in the literature that needs to be fulfilled. Personality-aware LLMs hold a promising direction towards this objective. However, the relationship between these psychological constructs and their representations within LLMs remains underexplored and requires further investigation. Moreover, it is intriguing to understand and study the use of these representations to steer the models' behaviour. We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits (Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism), which is a comprehensive and empirically validated framework to model human personality applies low-rank subspace discovery methods, and identifies trait-specific optimal layers across different model architectures for robust injection. The resulting personality-aligned directions are then operationalised through a flexible steering framework with dynamic layer selection, enabling precise control of trait expression in LLM outputs. Our findings reveal that personality traits occupy a low-rank shared subspace, and that these latent structures can be transformed into actionable mechanisms for effective steering through careful perturbations without impacting the fluency, variance and general capabilities, helping to bridge the gap between psychological theory and practical model alignment.
Related papers
- PERSONA: Dynamic and Compositional Inference-Time Personality Control via Activation Vector Algebra [84.59328460968872]
Current methods for personality control in Large Language Models rely on static prompting or expensive fine-tuning.<n>We introduce PERSONA, a training-free framework that achieves fine-tuning level performance through direct manipulation of personality vectors.<n>On PersonalityBench, our approach achieves a mean score of 9.60, nearly matching the supervised fine-tuning upper bound of 9.61 without any gradient updates.
arXiv Detail & Related papers (2026-02-17T15:47:58Z) - Structured Personality Control and Adaptation for LLM Agents [11.050618253938126]
Large Language Models (LLMs) are increasingly shaping human-computer interaction (HCI)<n>We present a framework that models LLM personality via Jungian psychological types.<n>This design allows the agent to maintain nuanced traits while dynamically adjusting to interaction demands.
arXiv Detail & Related papers (2026-01-15T03:15:24Z) - Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs [85.69785384599827]
Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them.<n>Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set.<n>We propose GRASP-HO, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem.
arXiv Detail & Related papers (2025-12-19T14:41:50Z) - From Narrative to Action: A Hierarchical LLM-Agent Framework for Human Mobility Generation [3.242664635630543]
Large language models (LLMs) show potential, but struggle to balance creative reasoning with strict structural compliance.<n>This study proposes a Hierarchical LLM-Agent Framework that integrates high-level narrative reasoning, mid-level reflective planning.<n>This research advances synthetic mobility generation from a data-driven paradigm to a spatial-driven simulation.
arXiv Detail & Related papers (2025-10-28T00:26:36Z) - The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs [60.15472325639723]
Personality traits have long been studied as predictors of human behavior.<n>Recent advances in Large Language Models (LLMs) suggest similar patterns may emerge in artificial systems.
arXiv Detail & Related papers (2025-09-03T21:27:10Z) - IROTE: Human-like Traits Elicitation of Large Language Model via In-Context Self-Reflective Optimization [66.6349183886101]
We propose IROTE, a novel in-context method for stable and transferable trait elicitation.<n>We show that one single IROTE-generated self-reflection can induce LLMs' stable impersonation of the target trait across diverse downstream tasks.
arXiv Detail & Related papers (2025-08-12T08:04:28Z) - SAC: A Framework for Measuring and Inducing Personality Traits in LLMs with Dynamic Intensity Control [1.9282110216621835]
Large language models (LLMs) have gained significant traction across a wide range of fields in recent years.<n>There is also a growing expectation for them to display human-like personalities during interactions.<n>Most existing models face two major limitations: they rely on the Big Five (OCEAN) framework, which only provides coarse personality dimensions, and they lack mechanisms for controlling trait intensity.
arXiv Detail & Related papers (2025-06-26T04:12:15Z) - Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control [44.326363467045496]
Large Language Models (LLMs) have become a critical area of research in Reinforcement Learning from Human Feedback (RLHF)
representation engineering offers a new, training-free approach.
This technique leverages semantic features to control the representation of LLM's intermediate hidden states.
It is difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature.
arXiv Detail & Related papers (2024-11-04T08:36:03Z) - Neuron-based Personality Trait Induction in Large Language Models [115.08894603023712]
Large language models (LLMs) have become increasingly proficient at simulating various personality traits.
We present a neuron-based approach for personality trait induction in LLMs.
arXiv Detail & Related papers (2024-10-16T07:47:45Z) - Exploring the Personality Traits of LLMs through Latent Features Steering [12.142248881876355]
We investigate how factors, such as cultural norms and environmental stressors, encoded within large language models (LLMs) shape their personality traits.<n>We propose a training-free approach to modify the model's behavior by extracting and steering latent features corresponding to factors within the model.
arXiv Detail & Related papers (2024-10-07T21:02:34Z) - PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
We propose PersLLM, a framework for better data construction and model tuning.<n>For insufficient data usage, we incorporate strategies such as Chain-of-Thought prompting and anti-induction.<n>For rigid behavior patterns, we design the tuning process and introduce automated DPO to enhance the specificity and dynamism of the models' personalities.
arXiv Detail & Related papers (2024-07-17T08:13:22Z) - Tuning-Free Accountable Intervention for LLM Deployment -- A
Metacognitive Approach [55.613461060997004]
Large Language Models (LLMs) have catalyzed transformative advances across a spectrum of natural language processing tasks.
We propose an innovative textitmetacognitive approach, dubbed textbfCLEAR, to equip LLMs with capabilities for self-aware error identification and correction.
arXiv Detail & Related papers (2024-03-08T19:18:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.