VERA-MH: Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health
- URL: http://arxiv.org/abs/2602.05088v2
- Date: Fri, 06 Feb 2026 14:08:26 GMT
- Title: VERA-MH: Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health
- Authors: Kate H. Bentley, Luca Belli, Adam M. Chekroud, Emily J. Ward, Emily R. Dworkin, Emily Van Ark, Kelly M. Johnston, Will Alexander, Millard Brown, Matt Hawrilenko,
- Abstract summary: The Validation of Ethical and Responsible AI in Mental Health (VERA-MH) evaluation was recently proposed to meet the urgent need for an evidence-based, automated safety benchmark.<n>This study aimed to examine the clinical validity and reliability of VERA-MH for evaluating AI safety in suicide risk detection and response.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Millions now use generative AI chatbots for psychological support. Despite the promise related to availability and scale, the single most pressing question in AI for mental health is whether these tools are safe. The Validation of Ethical and Responsible AI in Mental Health (VERA-MH) evaluation was recently proposed to meet the urgent need for an evidence-based, automated safety benchmark. This study aimed to examine the clinical validity and reliability of VERA-MH for evaluating AI safety in suicide risk detection and response. We first simulated a large set of conversations between large language model (LLM)-based users (user-agents) and general-purpose AI chatbots. Licensed mental health clinicians used a rubric (scoring guide) to independently rate the simulated conversations for safe and unsafe chatbot behaviors, as well as user-agent realism. An LLM-based judge used the same scoring rubric to evaluate the same set of simulated conversations. We then examined rating alignment (a) among individual clinicians and (b) between clinician consensus and the LLM judge, and (c) summarized clinicians' ratings of user-agent realism. Individual clinicians were generally consistent with one another in their safety ratings (chance-corrected inter-rater reliability [IRR] = 0.77), establishing a gold-standard clinical reference. The LLM judge was strongly aligned with this clinical consensus overall (IRR = 0.81) and within key conditions. Together, findings from this human evaluation study support the validity and reliability of VERA-MH: an open-source, automated AI safety evaluation for mental health. Future research will examine the generalizability and robustness of VERA-MH and expand the framework to target additional key areas of AI safety in mental health.
Related papers
- Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming [23.573537738272595]
We introduce an evaluation framework that pairs AI psychotherapists with simulated patient agents equipped with cognitive-affective models.<n>We apply this framework to a high-impact test case, Alcohol Use Disorder, evaluating six AI agents.<n>Our large-scale simulation reveals critical safety gaps in the use of AI for mental health support.
arXiv Detail & Related papers (2026-02-23T15:17:18Z) - Responsible Evaluation of AI for Mental Health [72.85175110624736]
Current approaches to evaluating AI tools in mental health care are fragmented and poorly aligned with clinical practice, social context, and first-hand user experience.<n>This paper argues for a rethinking of responsible evaluation by introducing an interdisciplinary framework that integrates clinical soundness, social context, and equity.
arXiv Detail & Related papers (2026-01-20T12:55:10Z) - DispatchMAS: Fusing taxonomy and artificial intelligence agents for emergency medical services [49.70819009392778]
Large Language Models (LLMs) and Multi-Agent Systems (MAS) offer opportunities to augment dispatchers.<n>This study aimed to develop and evaluate a taxonomy-grounded, multi-agent system for simulating realistic scenarios.
arXiv Detail & Related papers (2025-10-24T08:01:21Z) - VERA-MH Concept Paper [0.0]
We introduce VERA-MH, an automated evaluation of the safety of AI chatbots used in mental health contexts.<n>To fully automate the process, we used two ancillary AI agents.<n>Simulated conversations are then passed to a judge-agent who scores them based on the rubric.
arXiv Detail & Related papers (2025-10-17T04:07:29Z) - OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries [2.2807344448218507]
We evaluate our agentic, RAG-based clinical support assistant, DR.INFO, using HealthBench.<n>On the Hard subset of 1,000 challenging examples, DR.INFO achieves a HealthBench score of 0.51.<n>In a separate 100-sample evaluation against similar agentic RAG assistants, it maintains a performance lead with a health-bench score of 0.54.
arXiv Detail & Related papers (2025-08-29T09:51:41Z) - Mentalic Net: Development of RAG-based Conversational AI and Evaluation Framework for Mental Health Support [0.0]
Mentalic Net Conversational AI has a BERT Score of 0.898, with other evaluation metrics falling within satisfactory ranges.<n>We advocate for a human-in-the-loop approach and a long-term, responsible strategy in developing such transformative technologies.
arXiv Detail & Related papers (2025-08-27T03:44:56Z) - Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z) - MoodAngels: A Retrieval-augmented Multi-agent Framework for Psychiatry Diagnosis [58.67342568632529]
MoodAngels is the first specialized multi-agent framework for mood disorder diagnosis.<n>MoodSyn is an open-source dataset of 1,173 synthetic psychiatric cases.
arXiv Detail & Related papers (2025-06-04T09:18:25Z) - MAGI: Multi-Agent Guided Interview for Psychiatric Assessment [50.6150986786028]
We present MAGI, the first framework that transforms the gold-standard Mini International Neuropsychiatric Interview (MINI) into automatic computational navigation.<n>We show that MAGI advances LLM- assisted mental health assessment by combining clinical rigor, conversational adaptability, and explainable reasoning.
arXiv Detail & Related papers (2025-04-25T11:08:27Z) - Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.<n>Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.<n>We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z) - EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety [42.052840895090284]
EmoAgent is a multi-agent AI framework designed to evaluate and mitigate mental health hazards in human-AI interactions.<n>EmoEval simulates virtual users, including those portraying mentally vulnerable individuals, to assess mental health changes before and after interactions with AI characters.<n>EmoGuard serves as an intermediary, monitoring users' mental status, predicting potential harm, and providing corrective feedback to mitigate risks.
arXiv Detail & Related papers (2025-04-13T18:47:22Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability.<n>The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z) - Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools [13.386012271835039]
We created an evaluation framework with 100 benchmark questions and ideal responses.<n>This framework, validated by mental health experts, was tested on a GPT-3.5-turbo-based chatbots.
arXiv Detail & Related papers (2024-08-03T19:57:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.