Measuring Psychological Depth in Language Models
- URL: http://arxiv.org/abs/2406.12680v2
- Date: Fri, 04 Oct 2024 10:30:44 GMT
- Title: Measuring Psychological Depth in Language Models
- Authors: Fabrice Harel-Canada, Hanyu Zhou, Sreya Muppalla, Zeynep Yildiz, Miryung Kim, Amit Sahai, Nanyun Peng,
- Abstract summary: We introduce the Psychological Depth Scale (PDS), a novel framework rooted in literary theory that measures an LLM's ability to produce authentic and narratively complex stories.
We empirically validate our framework by showing that humans can consistently evaluate stories based on PDS (0.72 Krippendorff's alpha)
Surprisingly, GPT-4 stories either surpassed or were statistically indistinguishable from highly-rated human-written stories sourced from Reddit.
- Score: 50.48914935872879
- License:
- Abstract: Evaluations of creative stories generated by large language models (LLMs) often focus on objective properties of the text, such as its style, coherence, and diversity. While these metrics are indispensable, they do not speak to a story's subjective, psychological impact from a reader's perspective. We introduce the Psychological Depth Scale (PDS), a novel framework rooted in literary theory that measures an LLM's ability to produce authentic and narratively complex stories that provoke emotion, empathy, and engagement. We empirically validate our framework by showing that humans can consistently evaluate stories based on PDS (0.72 Krippendorff's alpha). We also explore techniques for automating the PDS to easily scale future analyses. GPT-4o, combined with a novel Mixture-of-Personas (MoP) prompting strategy, achieves an average Spearman correlation of 0.51 with human judgment while Llama-3-70B with constrained decoding scores as high as 0.68 for empathy. Finally, we compared the depth of stories authored by both humans and LLMs. Surprisingly, GPT-4 stories either surpassed or were statistically indistinguishable from highly-rated human-written stories sourced from Reddit. By shifting the focus from text to reader, the Psychological Depth Scale is a validated, automated, and systematic means of measuring the capacity of LLMs to connect with humans through the stories they tell.
Related papers
- The Essence of Contextual Understanding in Theory of Mind: A Study on Question Answering with Story Characters [67.61587661660852]
Theory-of-Mind (ToM) allows humans to understand and interpret the mental states of others.
In this paper, we verify the importance of understanding long personal backgrounds in ToM.
We assess the performance of machines' ToM capabilities in realistic evaluation scenarios.
arXiv Detail & Related papers (2025-01-03T09:04:45Z) - PhDGPT: Introducing a psychometric and linguistic dataset about how large language models perceive graduate students and professors in psychology [0.3749861135832073]
This work introduces PhDGPT, a prompting framework and synthetic dataset that encapsulates the machine psychology of PhD researchers and professors.
The dataset consists of 756,000 datapoints, counting 300 iterations repeated across 15 academic events, 2 biological genders, 2 career levels and 42 unique item responses of the Depression, Anxiety, and Stress Scale (DASS-42)
By combining network psychometrics and psycholinguistic dimensions, this study identifies several similarities and distinctions between human and LLM data.
arXiv Detail & Related papers (2024-11-06T20:04:20Z) - Judgment of Learning: A Human Ability Beyond Generative Artificial Intelligence [0.0]
Large language models (LLMs) increasingly mimic human cognition in various language-based tasks.
We introduce a cross-agent prediction model to assess whether ChatGPT-based LLMs align with human judgments of learning (JOL)
Our results revealed that while human JOL reliably predicted actual memory performance, none of the tested LLMs demonstrated comparable predictive accuracy.
arXiv Detail & Related papers (2024-10-17T09:42:30Z) - Are Large Language Models Capable of Generating Human-Level Narratives? [114.34140090869175]
This paper investigates the capability of LLMs in storytelling, focusing on narrative development and plot progression.
We introduce a novel computational framework to analyze narratives through three discourse-level aspects.
We show that explicit integration of discourse features can enhance storytelling, as is demonstrated by over 40% improvement in neural storytelling.
arXiv Detail & Related papers (2024-07-18T08:02:49Z) - Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models [57.518784855080334]
Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants.
This paper presents a framework for investigating psychology dimension in LLMs, including psychological identification, assessment dataset curation, and assessment with results validation.
We introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence.
arXiv Detail & Related papers (2024-06-25T16:09:08Z) - Towards a Psychology of Machines: Large Language Models Predict Human Memory [0.0]
Large language models (LLMs) have shown remarkable abilities in natural language processing.
This study explores whether LLMs can predict human memory performance in tasks involving garden-path sentences and contextual information.
arXiv Detail & Related papers (2024-03-08T08:41:14Z) - PsychoGAT: A Novel Psychological Measurement Paradigm through Interactive Fiction Games with LLM Agents [68.50571379012621]
Psychological measurement is essential for mental health, self-understanding, and personal development.
PsychoGAT (Psychological Game AgenTs) achieves statistically significant excellence in psychometric metrics such as reliability, convergent validity, and discriminant validity.
arXiv Detail & Related papers (2024-02-19T18:00:30Z) - PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for
Personality Detection [50.66968526809069]
We propose a novel personality detection method, called PsyCoT, which mimics the way individuals complete psychological questionnaires in a multi-turn dialogue manner.
Our experiments demonstrate that PsyCoT significantly improves the performance and robustness of GPT-3.5 in personality detection.
arXiv Detail & Related papers (2023-10-31T08:23:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.