Related papers: PsyEval: A Suite of Mental Health Related Tasks for Evaluating Large Language Models

Related papers

Reasoning Is Not All You Need: Examining LLMs for Multi-Turn Mental Health Conversations [13.064927179032756]
We introduce MedAgent, a novel framework for synthetically generating realistic, multi-turn mental health sensemaking conversations.<n>We present MultiSenseEval, a holistic framework to evaluate the multi-turn conversation abilities of LLMs in healthcare settings.
arXiv Detail & Related papers (2025-05-26T16:42:02Z)
ProMind-LLM: Proactive Mental Health Care via Causal Reasoning with Sensor Data [5.961343130822046]
Mental health risk is a critical global public health challenge.<n>With the development of large language models (LLMs), they stand out to be a promising tool for explainable mental health care applications.<n>This paper introduces ProMind-LLM, an innovative approach integrating objective behavior data as complementary information alongside subjective mental records.
arXiv Detail & Related papers (2025-05-20T07:36:28Z)
Ψ-Arena: Interactive Assessment and Optimization of LLM-based Psychological Counselors with Tripartite Feedback [51.26493826461026]
We propose Psi-Arena, an interactive framework for comprehensive assessment and optimization of large language models (LLMs)<n>Arena features realistic arena interactions that simulate real-world counseling through multi-stage dialogues with psychologically profiled NPC clients.<n>Experiments across eight state-of-the-art LLMs show significant performance variations in different real-world scenarios and evaluation perspectives.
arXiv Detail & Related papers (2025-05-06T08:22:51Z)
Humanizing LLMs: A Survey of Psychological Measurements with Tools, Datasets, and Human-Agent Applications [25.38031971196831]
Large language models (LLMs) are increasingly used in human-centered tasks. Assessing their psychological traits is crucial for understanding their social impact and ensuring trustworthy AI alignment. This study aims to propose future directions for developing more interpretable, robust, and generalizable psychological assessment frameworks for LLMs.
arXiv Detail & Related papers (2025-04-30T06:09:40Z)
Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns. Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance. We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions. We propose a novel approach utilizing structured medical reasoning. Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
PsychBench: A comprehensive and professional benchmark for evaluating the performance of LLM-assisted psychiatric clinical practice [20.166682569070073]
Large Language Models (LLMs) offer potential solutions to address problems such as shortage of medical resources and low diagnostic consistency in psychiatric clinical practice. We propose a benchmarking system, PsychBench, to evaluate the practical performance of LLMs in psychiatric clinical settings. We show that while existing models demonstrate significant potential, they are not yet adequate as decision-making tools in psychiatric clinical practice.
arXiv Detail & Related papers (2025-02-28T12:17:41Z)
Is your LLM trapped in a Mental Set? Investigative study on how mental sets affect the reasoning capabilities of LLMs [8.920202114368843]
We present an investigative study on how Mental Sets influence the reasoning capabilities of LLMs. Mental Sets refers to the tendency to persist with previously successful strategies, even when they become inefficient. We compare the performance of LLM models like Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct and GPT-4o in the presence of mental sets.
arXiv Detail & Related papers (2025-01-21T02:29:15Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment. We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews. Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
Severity Prediction in Mental Health: LLM-based Creation, Analysis, Evaluation of a Novel Multilingual Dataset [3.4146360486107987]
Large Language Models (LLMs) are increasingly integrated into various medical fields, including mental health support systems. We present a novel multilingual adaptation of widely-used mental health datasets, translated from English into six languages. This dataset enables a comprehensive evaluation of LLM performance in detecting mental health conditions and assessing their severity across multiple languages.
arXiv Detail & Related papers (2024-09-25T22:14:34Z)
PsycoLLM: Enhancing LLM for Psychological Understanding and Evaluation [27.575675130769437]
We propose a specialized psychological large language model (LLM), named PsycoLLM, trained on a proposed high-quality psychological dataset. We construct multi-turn dialogues through a three-step pipeline comprising generation, evidence judgment, and refinement. To compare the performance of PsycoLLM with other LLMs, we develop a comprehensive psychological benchmark based on authoritative psychological counseling examinations in China.
arXiv Detail & Related papers (2024-07-08T08:25:56Z)
Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models [57.518784855080334]
Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants. This paper presents a framework for investigating psychology dimension in LLMs, including psychological identification, assessment dataset curation, and assessment with results validation. We introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence.
arXiv Detail & Related papers (2024-06-25T16:09:08Z)
LLM Questionnaire Completion for Automatic Psychiatric Assessment [49.1574468325115]
We employ a Large Language Model (LLM) to convert unstructured psychological interviews into structured questionnaires spanning various psychiatric and personality domains. The obtained answers are coded as features, which are used to predict standardized psychiatric measures of depression (PHQ-8) and PTSD (PCL-C)
arXiv Detail & Related papers (2024-06-09T09:03:11Z)
Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: A Benchmark Study [17.32433545370711]
Comprehensive summaries of sessions enable an effective continuity in mental health counseling. Manual summarization presents a significant challenge, diverting experts' attention from the core counseling process. This study evaluates the effectiveness of state-of-the-art Large Language Models (LLMs) in selectively summarizing various components of therapy sessions.
arXiv Detail & Related papers (2024-02-29T11:29:47Z)
F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z)
Large Language Models in Mental Health Care: a Scoping Review [37.20036635036122]
This review aims to deliver a comprehensive analysis of Large Language Models (LLMs) utilization in mental health care.<n>A systematic search was performed across multiple databases including PubMed, Web of Science, Google Scholar, arXiv, medRxiv, and PsyArXiv.
arXiv Detail & Related papers (2024-01-01T17:35:52Z)
Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data [42.965788205842465]
We present a comprehensive evaluation of multiple large language models (LLMs) on various mental health prediction tasks. We conduct experiments covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%.
arXiv Detail & Related papers (2023-07-26T06:00:50Z)
A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry. This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z)
Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z)
Towards Interpretable Mental Health Analysis with Large Language Models [27.776003210275608]
We evaluate the mental health analysis and emotional reasoning ability of large language models (LLMs) on 11 datasets across 5 tasks. Based on prompts, we explore LLMs for interpretable mental health analysis by instructing them to generate explanations for each of their decisions. We convey strict human evaluations to assess the quality of the generated explanations, leading to a novel dataset with 163 human-assessed explanations.
arXiv Detail & Related papers (2023-04-06T19:53:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.