Related papers: The Linguistic Architecture of Reflective Thought: Evaluation of a Large Language Model as a Tool to Isolate the Formal Structure of Mentalization

The Linguistic Architecture of Reflective Thought: Evaluation of a Large Language Model as a Tool to Isolate the Formal Structure of Mentalization

URL: http://arxiv.org/abs/2512.08945v1
Date: Thu, 20 Nov 2025 23:51:34 GMT
Title: The Linguistic Architecture of Reflective Thought: Evaluation of a Large Language Model as a Tool to Isolate the Formal Structure of Mentalization
Authors: Stefano Epifani, Giuliano Castigliego, Laura Kecskemeti, Giuliano Razzicchia, Elisabeth Seiwald-Sonderegger,
Abstract summary: Mentalization integrates cognitive, affective, and intersubjective components.<n>Large Language Models (LLMs) display an increasing ability to generate reflective texts.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Background: Mentalization integrates cognitive, affective, and intersubjective components. Large Language Models (LLMs) display an increasing ability to generate reflective texts, raising questions regarding the relationship between linguistic form and mental representation. This study assesses the extent to which a single LLM can reproduce the linguistic structure of mentalization according to the parameters of Mentalization-Based Treatment (MBT). Methods: Fifty dialogues were generated between human participants and an LLM configured in standard mode. Five psychiatrists trained in MBT, working under blinded conditions, evaluated the mentalization profiles produced by the model along the four MBT axes, assigning Likert-scale scores for evaluative coherence, argumentative coherence, and global quality. Inter-rater agreement was estimated using ICC(3,1). Results: Mean scores (3.63-3.98) and moderate standard deviations indicate a high level of structural coherence in the generated profiles. ICC values (0.60-0.84) show substantial-to-high agreement among raters. The model proved more stable in the Implicit-Explicit and Self-Other dimensions, while presenting limitations in the integration of internal states and external contexts. The profiles were coherent and clinically interpretable yet characterized by affective neutrality.

Related papers

On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation [88.77441715819366]
Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content.<n>We propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity.
arXiv Detail & Related papers (2026-01-09T22:01:56Z)
HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs [20.794341575633503]
HeartBench is a framework designed to evaluate the integrated emotional, cultural, and ethical dimensions of Chinese Large Language Models (LLMs)<n>Even leading models achieve only 60% of the expert-defined ideal score.<n>Analysis using a difficulty-stratified Hard Set'' reveals a significant performance decay in scenarios involving subtle emotional subtexts and complex ethical trade-offs.
arXiv Detail & Related papers (2025-12-26T03:54:56Z)
I Am Aligned, But With Whom? MENA Values Benchmark for Evaluating Cultural Alignment and Multilingual Bias in LLMs [5.060243371992739]
We introduce MENAValues, a novel benchmark designed to evaluate the cultural alignment and multilingual biases of large language models (LLMs)<n> Drawing from large-scale, authoritative human surveys, we curate a structured dataset that captures the sociocultural landscape of MENA with population-level response distributions from 16 countries.<n>Our analysis reveals three critical phenomena: "Cross-Lingual Value Shifts" where identical questions yield drastically different responses based on language, "Reasoning-Induced Degradation" where prompting models to explain their reasoning worsens cultural alignment, and "Logit Leakage" where models refuse sensitive questions while internal probabilities reveal strong hidden
arXiv Detail & Related papers (2025-10-15T05:10:57Z)
On the Entity-Level Alignment in Crosslingual Consistency [62.33186691736433]
SubSub and SubInj integrate English translations of subjects into prompts across languages, leading to substantial gains in factual recall accuracy and consistency.<n>These interventions reinforce the entity representation alignment in the conceptual space through model's internal pivot-language processing.
arXiv Detail & Related papers (2025-10-11T16:26:50Z)
Analyzing Latent Concepts in Code Language Models [10.214183897113118]
We propose Code Concept Analysis (CoCoA): a global post-hoc interpretability framework.<n>CoCoA uncovers emergent lexical, syntactic, and semantic structures in a code language model's representation space.<n>We propose a hybrid annotation pipeline that combines static analysis tool-based syntactic alignment with prompt-engineered large language models.
arXiv Detail & Related papers (2025-10-01T03:53:21Z)
Does Language Model Understand Language? [1.0450509067356148]
Despite advances in natural language generation and understanding, LM still struggle with fine grained linguistic phenomena.<n>In this study, we conduct a evaluation of SOTA language models across challenging contexts in both English and Bengali.<n>Our findings highlight Compound-Beta as the most balanced model, consistently achieving high correlations and low MAEs across diverse language conditions.
arXiv Detail & Related papers (2025-09-15T21:09:09Z)
Measuring How LLMs Internalize Human Psychological Concepts: A preliminary analysis [0.0]
We develop a framework to assess concept alignment between Large Language Models and human psychological dimensions.<n>A GPT-4 model achieved superior classification accuracy (66.2%), significantly outperforming GPT-3.5 (55.9%) and BERT (48.1%)<n>Our findings demonstrate that modern LLMs can approximate human psychological constructs with measurable accuracy.
arXiv Detail & Related papers (2025-06-29T01:56:56Z)
Comparing Human Expertise and Large Language Models Embeddings in Content Validity Assessment of Personality Tests [0.0]
We explore the application of Large Language Models (LLMs) in assessing the content validity of psychometric instruments.<n>Using both human expert evaluations and advanced LLMs, we compared the accuracy of semantic item-construct alignment.<n>The results reveal distinct strengths and limitations of human and AI approaches.
arXiv Detail & Related papers (2025-03-15T10:54:35Z)
Large Language Models as Neurolinguistic Subjects: Discrepancy between Performance and Competence [49.60849499134362]
This study investigates the linguistic understanding of Large Language Models (LLMs) regarding signifier (form) and signified (meaning)<n>We introduce a neurolinguistic approach, utilizing a novel method that combines minimal pair and diagnostic probing to analyze activation patterns across model layers.<n>We found: (1) Psycholinguistic and neurolinguistic methods reveal that language performance and competence are distinct; (2) Direct probability measurement may not accurately assess linguistic competence; and (3) Instruction tuning won't change much competence but improve performance.
arXiv Detail & Related papers (2024-11-12T04:16:44Z)
Evaluating Large Language Models with Psychometrics [59.821829073478376]
This paper offers a comprehensive benchmark for quantifying psychological constructs of Large Language Models (LLMs)<n>Our work identifies five key psychological constructs -- personality, values, emotional intelligence, theory of mind, and self-efficacy -- assessed through a suite of 13 datasets.<n>We uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors.
arXiv Detail & Related papers (2024-06-25T16:09:08Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
Model Criticism for Long-Form Text Generation [113.13900836015122]
We apply a statistical tool, model criticism in latent space, to evaluate the high-level structure of generated text. We perform experiments on three representative aspects of high-level discourse -- coherence, coreference, and topicality. We find that transformer-based language models are able to capture topical structures but have a harder time maintaining structural coherence or modeling coreference.
arXiv Detail & Related papers (2022-10-16T04:35:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.