Related papers: Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs

Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs

URL: http://arxiv.org/abs/2602.19157v1
Date: Sun, 22 Feb 2026 12:39:02 GMT
Title: Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs
Authors: Wenqiu Tang, Zhen Wan, Takahiro Komamizu, Ichiro Ide,
Abstract summary: Personality control in Role-Playing Agents (RPAs) is commonly achieved via training-free methods.<n>We propose a contrastive Sparse AutoEncoder framework that learns facet-level personality control vectors aligned with the Big Five 30-facet model.
Score: 6.715533531385597
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Personality control in Role-Playing Agents (RPAs) is commonly achieved via training-free methods that inject persona descriptions and memory through prompts or retrieval-augmented generation, or via supervised fine-tuning (SFT) on persona-specific corpora. While SFT can be effective, it requires persona-labeled data and retraining for new roles, limiting flexibility. In contrast, prompt- and RAG-based signals are easy to apply but can be diluted in long dialogues, leading to drifting and sometimes inconsistent persona behavior. To address this, we propose a contrastive Sparse AutoEncoder (SAE) framework that learns facet-level personality control vectors aligned with the Big Five 30-facet model. A new 15,000-sample leakage-controlled corpus is constructed to provide balanced supervision for each facet. The learned vectors are integrated into the model's residual space and dynamically selected by a trait-activated routing module, enabling precise and interpretable personality steering. Experiments on Large Language Models (LLMs) show that the proposed method maintains stable character fidelity and output quality across contextualized settings, outperforming Contrastive Activation Addition (CAA) and prompt-only baselines. The combined SAE+Prompt configuration achieves the best overall performance, confirming that contrastively trained latent vectors can enhance persona control while preserving dialogue coherence.

Related papers

PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding [85.22047087898311]
We introduce Polarity-Prompt Contrastive Decoding (PromptCD), a test-time behavior control method that generalizes contrastive decoding to broader enhancement settings.<n>PromptCD constructs paired positive and negative guiding prompts for a target behavior and contrasts model responses to reinforce desirable outcomes.<n>Experiments on the "3H" alignment objectives demonstrate consistent and substantial improvements, indicating that post-trained models can achieve meaningful self-enhancement purely at test time.
arXiv Detail & Related papers (2026-02-24T08:56:52Z)
PERSONA: Dynamic and Compositional Inference-Time Personality Control via Activation Vector Algebra [84.59328460968872]
Current methods for personality control in Large Language Models rely on static prompting or expensive fine-tuning.<n>We introduce PERSONA, a training-free framework that achieves fine-tuning level performance through direct manipulation of personality vectors.<n>On PersonalityBench, our approach achieves a mean score of 9.60, nearly matching the supervised fine-tuning upper bound of 9.61 without any gradient updates.
arXiv Detail & Related papers (2026-02-17T15:47:58Z)
RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering [62.63376387138257]
We propose a plug-and-play intervention framework that adaptively steers large language models (LLMs) reasoning in activation space.<n>RISER constructs a library of reusable reasoning vectors and employs a lightweight Router to dynamically compose them for each input.<n>The Router is optimized via reinforcement learning under task-level rewards, activating latent cognitive primitives in an emergent and compositional manner.
arXiv Detail & Related papers (2026-01-14T08:04:33Z)
Steer Model beyond Assistant: Controlling System Prompt Strength via Contrastive Decoding [33.569783099301695]
Large language models excel at complex instructions yet struggle to deviate from their helpful assistant persona.<n>We introduce system prompt strength, a training-free method that treats prompt adherence as a continuous control.
arXiv Detail & Related papers (2026-01-10T02:56:38Z)
Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs [10.99947795031516]
Large Language Models exhibit implicit personalities in their generation, but reliably controlling or aligning these traits to meet specific needs remains an open challenge.<n>We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits.<n>Our findings reveal that personality traits occupy a low-rank shared subspace, and that these latent structures can be transformed into actionable mechanisms for effective steering.
arXiv Detail & Related papers (2025-10-29T05:56:39Z)
Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents [58.00130492861884]
TraitBasis is a lightweight, model-agnostic method for systematically stress testing AI agents.<n>TraitBasis learns directions in activation space corresponding to steerable user traits.<n>We observe on average a 2%-30% performance degradation on $tau$-Trait across frontier models.
arXiv Detail & Related papers (2025-10-06T05:03:57Z)
Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects [0.6087817758152709]
We present a systematic study of personality control using the Big Five traits.<n>Trait-level analysis shows openness as uniquely challenging, agreeableness as most resistant to ICL.<n>Experiments on Gemma-2-2B-IT and LLaMA-3-8B-Instruct reveal clear trade-offs.
arXiv Detail & Related papers (2025-09-05T04:19:15Z)
GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z)
LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation [102.1527101235251]
LangTraj is a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios.<n>By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors.<n>LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation.
arXiv Detail & Related papers (2025-04-15T17:14:06Z)
SMART: Self-supervised Multi-task pretrAining with contRol Transformers [34.604339091596884]
Self-supervised pretraining has been extensively studied in language and vision domains. It is difficult to properly design such a pretraining approach for sequential decision-making tasks. We propose a generic pretraining framework for sequential decision making.
arXiv Detail & Related papers (2023-01-24T05:01:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.