Related papers: DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses

DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses

URL: http://arxiv.org/abs/2512.02282v1
Date: Mon, 01 Dec 2025 23:53:45 GMT
Title: DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses
Authors: Han Luo, Guy Laban,
Abstract summary: We present DialogGuard, a multi-agent frame-work for assessing psychosocial risks in web-based responses.<n> DialogGuard can be applied to diverse gen- erative models through four LLM-as-a-judge pipelines.
Score: 4.663948718816864
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) now mediate many web-based mental- health, crisis, and other emotionally sensitive services, yet their psychosocial safety in these settings remains poorly understood and weakly evaluated. We present DialogGuard, a multi-agent frame- work for assessing psychosocial risks in LLM-generated responses along five high-severity dimensions: privacy violations, discrimi- natory behaviour, mental manipulation, psychological harm, and insulting behaviour. DialogGuard can be applied to diverse gen- erative models through four LLM-as-a-judge pipelines, including single-agent scoring, dual-agent correction, multi-agent debate, and stochastic majority voting, grounded in a shared three-level rubric usable by both human annotators and LLM judges. Using PKU-SafeRLHF with human safety annotations, we show that multi- agent mechanisms detect psychosocial risks more accurately than non-LLM baselines and single-agent judging; dual-agent correction and majority voting provide the best trade-off between accuracy, alignment with human ratings, and robustness, while debate attains higher recall but over-flags borderline cases. We release Dialog- Guard as open-source software with a web interface that provides per-dimension risk scores and explainable natural-language ratio- nales. A formative study with 12 practitioners illustrates how it supports prompt design, auditing, and supervision of web-facing applications for vulnerable users.

Related papers

Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations [16.626899117362875]
Suicide remains a pressing global public health concern.<n>Social media platforms offer opportunities for early risk detection through online conversation trees.<n>Existing approaches face two major limitations.
arXiv Detail & Related papers (2026-02-27T01:06:18Z)
MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support [9.430938712127231]
General-purpose safeguards fail to distinguish between therapeutic disclosures and genuine clinical crises.<n>We introduce a clinically grounded risk taxonomy, developed in collaboration with PhD-level psychologists.<n>We release MindGuard-testset, a dataset of real-world multi-turn conversations annotated by clinical experts.
arXiv Detail & Related papers (2026-02-01T01:03:20Z)
Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL [64.3268313484078]
Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare.<n>Their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns.<n>We investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception.
arXiv Detail & Related papers (2025-10-16T05:29:36Z)
How can we assess human-agent interactions? Case studies in software agent design [52.953425368394306]
We make two major steps towards the rigorous assessment of human-agent interactions.<n>We propose PULSE, a framework for more efficient human-centric evaluation of agent designs.<n>We deploy the framework on a large-scale web platform built around the open-source software agent OpenHands.
arXiv Detail & Related papers (2025-10-10T19:04:28Z)
The Sum Leaks More Than Its Parts: Compositional Privacy Risks and Mitigations in Multi-Agent Collaboration [72.33801123508145]
Large language models (LLMs) are integral to multi-agent systems.<n>Privacy risks emerge that extend beyond memorization, direct inference, or single-turn evaluations.<n>In particular, seemingly innocuous responses, when composed across interactions, can cumulatively enable adversaries to recover sensitive information.
arXiv Detail & Related papers (2025-09-16T16:57:25Z)
BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks [58.959622170433725]
BlindGuard is an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors.<n>We show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across multi-agent systems.
arXiv Detail & Related papers (2025-08-11T16:04:47Z)
Confident-Knowledge Diversity Drives Human-Human and Human-AI Free Discussion Synergy and Reveals Pure-AI Discussion Shortfalls [3.335241944417891]
We study whether large language models can replicate the synergistic gains observed in human discussion.<n>We introduce an agent-agnostic confident-knowledge framework that models each participant by performance (accuracy) and confidence.<n>This framework quantifies confident-knowledge diversity, the degree to which one agent tends to be correct when another is uncertain, and yields a conservative upper bound on gains achievable via confidence-informed decisions.
arXiv Detail & Related papers (2025-06-15T05:09:20Z)
Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models [75.85319609088354]
Sentient Agent as a Judge (SAGE) is an evaluation framework for large language models.<n>SAGE instantiates a Sentient Agent that simulates human-like emotional changes and inner thoughts during interaction.<n>SAGE provides a principled, scalable and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.
arXiv Detail & Related papers (2025-05-01T19:06:10Z)
Personalized Attacks of Social Engineering in Multi-turn Conversations: LLM Agents for Simulation and Detection [19.604708321391012]
Social engineering (SE) attacks on social media platforms pose a significant risk.<n>We propose an LLM-agentic framework, SE-VSim, to simulate SE attack mechanisms by generating multi-turn conversations.<n>We present a proof of concept, SE-OmniGuard, to offer personalized protection to users by leveraging prior knowledge of the victims personality.
arXiv Detail & Related papers (2025-03-18T19:14:44Z)
Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents [23.960719833886984]
M-CoDAL is a multimodal-dialogue system specifically designed for embodied agents to better understand and communicate in safety-critical situations.<n>Our approach is evaluated using a newly created multimodal dataset comprising 1K safety violations extracted from 2K Reddit images.<n>Results with this dataset demonstrate that our approach improves resolution of safety situations, user sentiment, as well as safety of the conversation.
arXiv Detail & Related papers (2024-10-18T03:26:06Z)
PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety [70.84902425123406]
Multi-agent systems, when enhanced with Large Language Models (LLMs), exhibit profound capabilities in collective intelligence. However, the potential misuse of this intelligence for malicious purposes presents significant risks. We propose a framework (PsySafe) grounded in agent psychology, focusing on identifying how dark personality traits in agents can lead to risky behaviors. Our experiments reveal several intriguing phenomena, such as the collective dangerous behaviors among agents, agents' self-reflection when engaging in dangerous behavior, and the correlation between agents' psychological assessments and dangerous behaviors.
arXiv Detail & Related papers (2024-01-22T12:11:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.