Related papers: Localizing Persona Representations in LLMs

Localizing Persona Representations in LLMs

URL: http://arxiv.org/abs/2505.24539v2
Date: Tue, 03 Jun 2025 08:45:28 GMT
Title: Localizing Persona Representations in LLMs
Authors: Celia Cintas, Miriam Rateike, Erik Miehling, Elizabeth Daly, Skyler Speakman,
Abstract summary: We study how and where personas are encoded in the representation space of large language models (LLMs)<n>We observe overlapping activations for specific ethical perspectives, such as moral nihilism and utilitarianism.<n>In contrast, political ideologies like conservatism and liberalism appear to be represented in more distinct regions.
Score: 5.828323647048382
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present a study on how and where personas -- defined by distinct sets of human characteristics, values, and beliefs -- are encoded in the representation space of large language models (LLMs). Using a range of dimension reduction and pattern recognition methods, we first identify the model layers that show the greatest divergence in encoding these representations. We then analyze the activations within a selected layer to examine how specific personas are encoded relative to others, including their shared and distinct embedding spaces. We find that, across multiple pre-trained decoder-only LLMs, the analyzed personas show large differences in representation space only within the final third of the decoder layers. We observe overlapping activations for specific ethical perspectives -- such as moral nihilism and utilitarianism -- suggesting a degree of polysemy. In contrast, political ideologies like conservatism and liberalism appear to be represented in more distinct regions. These findings help to improve our understanding of how LLMs internally represent information and can inform future efforts in refining the modulation of specific human traits in LLM outputs. Warning: This paper includes potentially offensive sample statements.

Related papers

From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning [52.32745233116143]
Humans organize knowledge into compact categories through semantic compression.<n>Large Language Models (LLMs) demonstrate remarkable linguistic abilities.<n>But whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear.
arXiv Detail & Related papers (2025-05-21T16:29:00Z)
Linear Representations of Political Perspective Emerge in Large Language Models [2.2462222233189286]
Large language models (LLMs) have demonstrated the ability to generate text that realistically reflects a range of different subjective human perspectives.<n>This paper studies how LLMs are seemingly able to reflect more liberal versus more conservative viewpoints among other political perspectives in American politics.
arXiv Detail & Related papers (2025-03-03T21:59:01Z)
Analyze the Neurons, not the Embeddings: Understanding When and Where LLM Representations Align with Humans [3.431979707540646]
This work introduces a novel approach to the study of representation alignment.<n>We adopt a method from research on activation steering to identify neurons responsible for specific concepts.<n>Our findings reveal that LLM representations closely align with human representations inferred from behavioral data.
arXiv Detail & Related papers (2025-02-20T23:08:03Z)
Large Language Models as Neurolinguistic Subjects: Discrepancy in Performance and Competence for Form and Meaning [49.60849499134362]
This study investigates the linguistic understanding of Large Language Models (LLMs) regarding signifier (form) and signified (meaning)<n>We introduce a neurolinguistic approach, utilizing a novel method that combines minimal pair and diagnostic probing to analyze activation patterns across model layers.<n>We found: (1) Psycholinguistic and neurolinguistic methods reveal that language performance and competence are distinct; (2) Direct probability measurement may not accurately assess linguistic competence; and (3) Instruction tuning won't change much competence but improve performance.
arXiv Detail & Related papers (2024-11-12T04:16:44Z)
Large Language Models Reflect the Ideology of their Creators [71.65505524599888]
Large language models (LLMs) are trained on vast amounts of data to generate natural language.<n>This paper shows that the ideological stance of an LLM appears to reflect the worldview of its creators.
arXiv Detail & Related papers (2024-10-24T04:02:30Z)
Hate Personified: Investigating the role of LLMs in content moderation [64.26243779985393]
For subjective tasks such as hate detection, where people perceive hate differently, the Large Language Model's (LLM) ability to represent diverse groups is unclear. By including additional context in prompts, we analyze LLM's sensitivity to geographical priming, persona attributes, and numerical information to assess how well the needs of various groups are reflected.
arXiv Detail & Related papers (2024-10-03T16:43:17Z)
Beyond Human Norms: Unveiling Unique Values of Large Language Models through Interdisciplinary Approaches [69.73783026870998]
This work proposes a novel framework, ValueLex, to reconstruct Large Language Models' unique value system from scratch. Based on Lexical Hypothesis, ValueLex introduces a generative approach to elicit diverse values from 30+ LLMs. We identify three core value dimensions, Competence, Character, and Integrity, each with specific subdimensions, revealing that LLMs possess a structured, albeit non-human, value system.
arXiv Detail & Related papers (2024-04-19T09:44:51Z)
High-Dimension Human Value Representation in Large Language Models [60.33033114185092]
We propose UniVaR, a high-dimensional neural representation of symbolic human value distributions in LLMs.<n>This is a continuous and scalable representation, self-supervised from the value-relevant output of 8 LLMs.<n>We explore how LLMs prioritize different values in 25 languages and cultures, shedding light on complex interplay between human values and language modeling.
arXiv Detail & Related papers (2024-04-11T16:39:00Z)
Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention [53.896974148579346]
Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains. The enigmatic black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications. We propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs.
arXiv Detail & Related papers (2023-12-22T19:55:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.