Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis
- URL: http://arxiv.org/abs/2510.19451v1
- Date: Wed, 22 Oct 2025 10:29:14 GMT
- Title: Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis
- Authors: Xueqi Ma, Yanbei Jiang, Sarah Erfani, James Bailey, Weifeng Liu, Krista A. Ehinger, Jey Han Lau,
- Abstract summary: PICK is a multi-step framework designed for Psychoanalytical Image through hierarchical analysis and knowledge injection.<n>It focuses on the House-Tree-Person (HTP) Test, a widely used psychological assessment in clinical practice.<n>Our approach bridges the gap between MLLMs and specialized expert domains, offering a structured and interpretable framework for understanding human mental states through visual expression.
- Score: 38.98188484491387
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance across various objective multimodal perception tasks, yet their application to subjective, emotionally nuanced domains, such as psychological analysis, remains largely unexplored. In this paper, we introduce PICK, a multi-step framework designed for Psychoanalytical Image Comprehension through hierarchical analysis and Knowledge injection with MLLMs, specifically focusing on the House-Tree-Person (HTP) Test, a widely used psychological assessment in clinical practice. First, we decompose drawings containing multiple instances into semantically meaningful sub-drawings, constructing a hierarchical representation that captures spatial structure and content across three levels: single-object level, multi-object level, and whole level. Next, we analyze these sub-drawings at each level with a targeted focus, extracting psychological or emotional insights from their visual cues. We also introduce an HTP knowledge base and design a feature extraction module, trained with reinforcement learning, to generate a psychological profile for single-object level analysis. This profile captures both holistic stylistic features and dynamic object-specific features (such as those of the house, tree, or person), correlating them with psychological states. Finally, we integrate these multi-faceted information to produce a well-informed assessment that aligns with expert-level reasoning. Our approach bridges the gap between MLLMs and specialized expert domains, offering a structured and interpretable framework for understanding human mental states through visual expression. Experimental results demonstrate that the proposed PICK significantly enhances the capability of MLLMs in psychological analysis. It is further validated as a general framework through extensions to emotion understanding tasks.
Related papers
- Projective Psychological Assessment of Large Multimodal Models Using Thematic Apperception Tests [5.119837168333715]
This study examines whether the personality traits of Large Multimodal Models (LMMs) can be assessed through non-language-based modalities.<n>Evaluators demonstrated an excellent ability to understand and analyze TAT responses.
arXiv Detail & Related papers (2026-02-19T06:08:33Z) - Toward Cognitive Supersensing in Multimodal Large Language Model [67.15559571626747]
We introduce Cognitive Supersensing, a training paradigm that endows MLLMs with human-like visual imagery capabilities.<n>In experiments, MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench.<n>We will open-source the CogSense-Bench and our model weights.
arXiv Detail & Related papers (2026-02-02T02:19:50Z) - From Visual Perception to Deep Empathy: An Automated Assessment Framework for House-Tree-Person Drawings Using Multimodal LLMs and Multi-Agent Collaboration [18.359999860873426]
The House-Tree-Person drawing test, introduced by John Buck in 1948, remains a widely used projective technique in clinical psychology.<n>It has long faced challenges such as heterogeneous scoring standards, reliance on examiners subjective experience, and a lack of a unified quantitative coding system.<n>The proposed multi-agent framework, by dividing roles, decouples feature recognition from psychological inference and offers a new paradigm for digital mental-health services.
arXiv Detail & Related papers (2025-12-23T09:26:23Z) - Cognitive Mirrors: Exploring the Diverse Functional Roles of Attention Heads in LLM Reasoning [54.12174882424842]
Large language models (LLMs) have achieved state-of-the-art performance in a variety of tasks, but remain largely opaque in terms of their internal mechanisms.<n>We propose a novel interpretability framework to systematically analyze the roles and behaviors of attention heads.
arXiv Detail & Related papers (2025-12-03T10:24:34Z) - From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models [66.36007274540113]
Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world.<n>They often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition)<n>This survey introduces a novel and unified analytical framework: From Perception to Cognition"
arXiv Detail & Related papers (2025-09-29T18:25:40Z) - Hyperphantasia: A Benchmark for Evaluating the Mental Visualization Capabilities of Multimodal LLMs [22.46006112029019]
Mental visualization is a critical cognitive skill in humans, supporting abilities such as spatial navigation, predicting physical trajectories, and solving complex visual problems through imaginative simulation.<n>We introduce Hyperphantasia, a synthetic benchmark designed to evaluate the mental visualization abilities of Multimodal Large Language Models (MLLMs) through four carefully constructed puzzles.<n>Our comprehensive evaluation of state-of-the-art models reveals a substantial gap between the performance of humans and MLLMs.
arXiv Detail & Related papers (2025-07-16T05:54:37Z) - MADP: Multi-Agent Deductive Planning for Enhanced Cognitive-Behavioral Mental Health Question Answer [7.738135970011351]
We propose a framework named Multi-Agent Deductive Planning (MADP)<n>MADP is based on the interactions between the various psychological elements of Cognitive Behavioral Therapy (CBT)<n>We construct a new dataset based on the MADP framework and use it to fine-tune Large Language Models (LLMs)
arXiv Detail & Related papers (2025-01-27T07:18:47Z) - Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.<n>Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.<n>We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z) - MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image [16.040813949620958]
We introduce MOSABench, a novel evaluation dataset designed specifically for multi-object sentiment analysis.<n>Key innovations in MOSABench include distance-based target annotation, post-processing for evaluation to standardize outputs, and an improved scoring mechanism.<n>This research underscores the need for MLLMs to enhance accuracy in complex, multi-object sentiment analysis tasks.
arXiv Detail & Related papers (2024-11-25T09:00:36Z) - Evaluating Large Language Models with Psychometrics [59.821829073478376]
This paper offers a comprehensive benchmark for quantifying psychological constructs of Large Language Models (LLMs)<n>Our work identifies five key psychological constructs -- personality, values, emotional intelligence, theory of mind, and self-efficacy -- assessed through a suite of 13 datasets.<n>We uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors.
arXiv Detail & Related papers (2024-06-25T16:09:08Z) - MOPT: Multi-Object Panoptic Tracking [33.77171216778909]
We introduce a novel perception task denoted as multi-object panoptic tracking (MOPT)
MOPT allows for exploiting pixel-level semantic information of 'thing' and'stuff' classes, temporal coherence, and pixel-level associations over time.
We present extensive quantitative and qualitative evaluations of both vision-based and LiDAR-based MOPT that demonstrate encouraging results.
arXiv Detail & Related papers (2020-04-17T11:45:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.