TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
- URL: http://arxiv.org/abs/2602.22775v1
- Date: Thu, 26 Feb 2026 09:11:34 GMT
- Title: TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
- Authors: Joydeep Chandra, Satyam Kumar Navneet, Yong Zhang,
- Abstract summary: We introduce TherapyProbe, a design probe methodology that generates actionable design knowledge.<n>We contribute: (1) a replicable methodology requiring no API costs, (2) a clinically-grounded failure taxonomy, and (3) design implications for developers, clinicians, and policymakers.
- Score: 6.821769033209393
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness of individual responses? Current safety evaluations assess single-turn crisis responses, missing the therapeutic dynamics that determine whether chatbots help or harm over time. We introduce TherapyProbe, a design probe methodology that generates actionable design knowledge by systematically exploring chatbot conversation trajectories through adversarial multi-agent simulation. Using open-source models, TherapyProbe surfaces relational safety failures interaction patterns like "validation spirals" where chatbots progressively reinforce hopelessness, or "empathy fatigue" where responses become mechanical over turns. Our contribution is translating these failures into a Safety Pattern Library of 23 failure archetypes with corresponding design recommendations. We contribute: (1) a replicable methodology requiring no API costs, (2) a clinically-grounded failure taxonomy, and (3) design implications for developers, clinicians, and policymakers.
Related papers
- Do You Understand How I Feel?: Towards Verified Empathy in Therapy Chatbots [2.0452773268886126]
This paper envisions a framework integrating natural language processing and formal verification to deliver empathetic therapy chatbots.<n>A Transformer-based model extracts dialogue features, which are then translated into a Hybrid Automaton model of dyadic therapy sessions.<n>Empathy-related properties can then be verified through Statistical Model Checking.<n>Preliminary results show that the formal model captures therapy dynamics with good fidelity and that ad-hoc strategies improve the probability of satisfying empathy requirements.
arXiv Detail & Related papers (2026-01-13T12:08:58Z) - "Even GPT Can Reject Me": Conceptualizing Abrupt Refusal Secondary Harm (ARSH) and Reimagining Psychological AI Safety with Compassionate Completion Standard (CCS) [10.377213441117618]
We argue that abrupt refusals can rupture perceived relational continuity, evoke feelings of rejection or shame, and discourage future help seeking.<n>We propose a design hypothesis, the Compassionate Completion Standard, that maintains safety constraints while preserving relational coherence.<n>This viewpoint contributes a timely conceptual framework, articulates a testable design hypothesis, and outlines a coordinated research agenda for improving psychological safety in human AI interaction.
arXiv Detail & Related papers (2025-12-21T15:31:15Z) - Mitigating Harmful Erraticism in LLMs Through Dialectical Behavior Therapy Based De-Escalation Strategies [0.0]
This paper hypothesizes that a framework rooted in human psychological principles, specifically therapeutic modalities, can provide a more robust and sustainable solution.<n> Drawing an analogy to the simulated neural networks of AI mirroring the human brain, we propose the application of Dialectical Behavior Therapy (DBT) principles.
arXiv Detail & Related papers (2025-09-06T11:20:15Z) - Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models [72.36715571932696]
Narrative therapy helps individuals transform problematic life stories into empowering alternatives.<n>Current approaches lack realism in specialized psychotherapy and fail to capture therapeutic progression over time.<n>Int (Interactive Narrative Therapist) simulates expert narrative therapists by planning therapeutic stages, guiding reflection levels, and generating contextually appropriate expert-like responses.
arXiv Detail & Related papers (2025-07-27T11:52:09Z) - Do We Talk to Robots Like Therapists, and Do They Respond Accordingly? Language Alignment in AI Emotional Support [6.987852837732702]
This study investigates whether concerns shared with a robot align with those shared in human-to-human (H2H) therapy sessions.<n>We analyzed two datasets: one of interactions between users and professional therapists, and another involving supportive conversations with a social robot.<n>Results showed that 90.88% of robot conversation disclosures could be mapped to clusters from the human therapy dataset.
arXiv Detail & Related papers (2025-06-19T17:20:30Z) - CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy [67.23830698947637]
We propose a new benchmark, CBT-BENCH, for the systematic evaluation of cognitive behavioral therapy (CBT) assistance.<n>We include three levels of tasks in CBT-BENCH: I: Basic CBT knowledge acquisition, with the task of multiple-choice questions; II: Cognitive model understanding, with the tasks of cognitive distortion classification, primary core belief classification, and fine-grained core belief classification; III: Therapeutic response generation, with the task of generating responses to patient speech in CBT therapy sessions.<n> Experimental results indicate that while LLMs perform well in reciting CBT knowledge, they fall short in complex real-world scenarios
arXiv Detail & Related papers (2024-10-17T04:52:57Z) - Towards Mitigating Hallucination in Large Language Models via
Self-Reflection [63.2543947174318]
Large language models (LLMs) have shown promise for generative and knowledge-intensive tasks including question-answering (QA) tasks.
This paper analyses the phenomenon of hallucination in medical generative QA systems using widely adopted LLMs and datasets.
arXiv Detail & Related papers (2023-10-10T03:05:44Z) - Using In-Context Learning to Improve Dialogue Safety [45.303005593685036]
We investigate a retrieval-based method for reducing bias and toxicity in responses from chatbots.
It uses in-context learning to steer a model towards safer generations.
We find our method performs competitively with strong baselines without requiring training.
arXiv Detail & Related papers (2023-02-02T04:46:03Z) - An Evaluation of Generative Pre-Training Model-based Therapy Chatbot for
Caregivers [5.2116528363639985]
Generative-based approaches, such as the OpenAI GPT models, could allow for more dynamic conversations in therapy contexts.
We built a chatbots using the GPT-2 model and fine-tuned it with 306 therapy session transcripts between family caregivers of individuals with dementia and therapists conducting Problem Solving Therapy.
Results showed that the fine-tuned model created more non-word outputs than the pre-trained model.
arXiv Detail & Related papers (2021-07-28T01:01:08Z) - Enabling AI and Robotic Coaches for Physical Rehabilitation Therapy:
Iterative Design and Evaluation with Therapists and Post-Stroke Survivors [66.07833535962762]
Artificial intelligence (AI) and robotic coaches promise the improved engagement of patients on rehabilitation exercises through social interaction.
Previous work explored the potential of automatically monitoring exercises for AI and robotic coaches, but deployment remains a challenge.
We present our efforts on eliciting the detailed design specifications on how AI and robotic coaches could interact with and guide patient's exercises.
arXiv Detail & Related papers (2021-06-15T22:06:39Z) - Emotion-aware Chat Machine: Automatic Emotional Response Generation for
Human-like Emotional Interaction [55.47134146639492]
This article proposes a unifed end-to-end neural architecture, which is capable of simultaneously encoding the semantics and the emotions in a post.
Experiments on real-world data demonstrate that the proposed method outperforms the state-of-the-art methods in terms of both content coherence and emotion appropriateness.
arXiv Detail & Related papers (2021-06-06T06:26:15Z) - Counterfactual Off-Policy Training for Neural Response Generation [94.76649147381232]
We propose to explore potential responses by counterfactual reasoning.
Training on the counterfactual responses under the adversarial learning framework helps to explore the high-reward area of the potential response space.
An empirical study on the DailyDialog dataset shows that our approach significantly outperforms the HRED model.
arXiv Detail & Related papers (2020-04-29T22:46:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.