ExpGuard: LLM Content Moderation in Specialized Domains
- URL: http://arxiv.org/abs/2603.02588v1
- Date: Tue, 03 Mar 2026 04:09:49 GMT
- Title: ExpGuard: LLM Content Moderation in Specialized Domains
- Authors: Minseok Choi, Dongjin Kim, Seungbin Yang, Subin Kim, Youngjun Kwak, Juyoung Oh, Jaegul Choo, Jungmin Son,
- Abstract summary: Current guardrail models predominantly address general human-LLM interactions.<n>We introduce ExpGuard, a robust guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains.<n>We present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses.
- Score: 46.00867862478331
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets: ExpGuardTrain, for model training, and ExpGuardTest, a high-quality test set annotated by domain experts to evaluate model robustness against technical and domain-specific content. Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial attacks, surpassing state-of-the-art models such as WildGuard by up to 8.9% in prompt classification and 15.3% in response classification. To encourage further research and development, we open-source our code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.
Related papers
- ProGuard: Towards Proactive Multimodal Safeguard [48.89789547707647]
ProGuard is a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks.<n>We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories.<n>We then train our vision-language base model purely through reinforcement learning to achieve efficient and concise reasoning.
arXiv Detail & Related papers (2025-12-29T16:13:23Z) - AprielGuard [2.3704817495377526]
Existing tools treat safety risks as separate problems, limiting robustness and generalizability.<n>We introduce AprielGuard, an 8B parameter safeguard model that unify these dimensions within a single taxonomy and learning framework.<n> AprielGuard achieves strong performance in detecting harmful content and adversarial manipulations.
arXiv Detail & Related papers (2025-12-23T12:01:32Z) - Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems [4.404101728634984]
Protect is a multi-modal guardrailing model designed to operate seamlessly across text, image, and audio inputs.<n>It integrates category-specific adapters trained via Low-Rank Adaptation (LoRA) on an extensive, multi-modal dataset.<n>Our teacher-assisted annotation pipeline leverages reasoning and explanation traces to generate high-fidelity, context-aware labels.
arXiv Detail & Related papers (2025-10-15T09:40:24Z) - IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement [35.904652937034136]
We introduce IntentionReasoner, a novel safeguard mechanism that leverages a dedicated guard model to perform intent reasoning.<n>We show that IntentionReasoner excels in multiple safeguard benchmarks, generation quality evaluations, and jailbreak attack scenarios.
arXiv Detail & Related papers (2025-08-27T16:47:31Z) - Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - WebGuard: Building a Generalizable Guardrail for Web Agents [59.31116061613742]
WebGuard is the first dataset designed to support the assessment of web agent action risks.<n>It contains 4,939 human-annotated actions from 193 websites across 22 diverse domains.
arXiv Detail & Related papers (2025-07-18T18:06:27Z) - GuardSet-X: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset [18.306944278068638]
We introduce GuardSet-X, the first massive multi-domain safety policy-grounded guardrail dataset.<n> GuardSet-X offers broad domain coverage across eight safety-critical domains, such as finance, law, and codeGen.<n>We benchmark 19 advanced guardrail models and uncover a series of findings.
arXiv Detail & Related papers (2025-06-18T01:35:33Z) - Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains [92.36624674516553]
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs)<n>We investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education.<n>We utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications.
arXiv Detail & Related papers (2025-03-31T08:22:49Z) - RapGuard: Safeguarding Multimodal Large Language Models via Rationale-aware Defensive Prompting [7.0595410083835315]
RapGuard is a novel framework that uses multimodal chain-of-thought reasoning to generate scenario-specific safety prompts.<n>RapGuard achieves state-of-the-art safety performance, significantly reducing harmful content without degrading the quality of responses.
arXiv Detail & Related papers (2024-12-25T08:31:53Z) - ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2.
Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.