RiskAtlas: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation
- URL: http://arxiv.org/abs/2601.04740v1
- Date: Thu, 08 Jan 2026 09:05:28 GMT
- Title: RiskAtlas: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation
- Authors: Huawei Zheng, Xinqi Jiang, Sen Yang, Shouling Ji, Yingcai Wu, Dazhen Deng,
- Abstract summary: Large language models (LLMs) are increasingly applied in specialized domains such as finance and healthcare.<n>We propose an end-to-end framework that performs knowledge-graph-guided harmful prompt generation and applies dual-path obfuscation rewriting.<n>This framework yields high-quality datasets combining strong domain relevance with implicitness.
- Score: 53.47466016688839
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are increasingly applied in specialized domains such as finance and healthcare, where they introduce unique safety risks. Domain-specific datasets of harmful prompts remain scarce and still largely rely on manual construction; public datasets mainly focus on explicit harmful prompts, which modern LLM defenses can often detect and refuse. In contrast, implicit harmful prompts-expressed through indirect domain knowledge-are harder to detect and better reflect real-world threats. We identify two challenges: transforming domain knowledge into actionable constraints and increasing the implicitness of generated harmful prompts. To address them, we propose an end-to-end framework that first performs knowledge-graph-guided harmful prompt generation to systematically produce domain-relevant prompts, and then applies dual-path obfuscation rewriting to convert explicit harmful prompts into implicit variants via direct and context-enhanced rewriting. This framework yields high-quality datasets combining strong domain relevance with implicitness, enabling more realistic red-teaming and advancing LLM safety research. We release our code and datasets at GitHub.
Related papers
- SoK: Privacy Risks and Mitigations in Retrieval-Augmented Generation Systems [53.51921540246166]
Retrieval-Augmented Generation (RAG) techniques have become widely popular.<n>RAG involves the coupling of Large Language Models (LLMs) with domain-specific knowledge bases.<n>The proliferation of RAG has sparked concerns about data privacy.
arXiv Detail & Related papers (2026-01-07T14:50:41Z) - Learning to Extract Context for Context-Aware LLM Inference [60.376872353918394]
User prompts to large language models (LLMs) are often ambiguous or under-specified.<n> contextual cues shaped by user intentions, prior knowledge, and risk factors influence what constitutes an appropriate response.<n>We propose a framework that extracts and leverages such contextual information from the user prompt itself.
arXiv Detail & Related papers (2025-12-12T19:10:08Z) - RAG Security and Privacy: Formalizing the Threat Model and Attack Surface [4.823988025629304]
Retrieval-Augmented Generation (RAG) is an emerging approach in natural language processing that combines large language models (LLMs) with external document retrieval to produce more accurate and grounded responses.<n>Existing research has demonstrated that RAGs can leak sensitive information through training data memorization or adversarial prompts, and RAG systems inherit many of these vulnerabilities.<n>Despite these risks, there is currently no formal framework that defines the threat landscape for RAG systems.
arXiv Detail & Related papers (2025-09-24T17:11:35Z) - Fine-Grained Privacy Extraction from Retrieval-Augmented Generation Systems via Knowledge Asymmetry Exploitation [15.985529058573912]
Retrieval-augmented generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge bases.<n>Existing privacy attacks on RAG systems can trigger data leakage but often fail to accurately isolate knowledge-base-derived sentences within mixed responses.<n>This paper presents a novel black-box attack framework that exploits knowledge asymmetry between RAG and standard LLMs to achieve fine-grained privacy extraction.
arXiv Detail & Related papers (2025-07-31T03:50:16Z) - Explicit Vulnerability Generation with LLMs: An Investigation Beyond Adversarial Attacks [0.5218155982819203]
Large Language Models (LLMs) are increasingly used as code assistants.<n>This study examines a more direct threat: open-source LLMs generating vulnerable code when prompted.
arXiv Detail & Related papers (2025-07-14T08:36:26Z) - ROSE: Toward Reality-Oriented Safety Evaluation of Large Language Models [60.28667314609623]
Large Language Models (LLMs) are increasingly deployed as black-box components in real-world applications.<n>We propose Reality-Oriented Safety Evaluation (ROSE), a novel framework that uses multi-objective reinforcement learning to fine-tune an adversarial LLM.
arXiv Detail & Related papers (2025-06-17T10:55:17Z) - Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures [17.9033567125575]
Large language models (LLMs) are increasingly deployed in real-world applications, raising concerns about their security.<n>While jailbreak attacks highlight failures under overtly harmful queries, they overlook a critical risk: incorrectly answering harmless-looking inputs can be dangerous and cause real-world harm (Implicit Harm)<n>We systematically reformulate the LLM risk landscape through a structured quadrant perspective based on output factuality and input harmlessness, uncovering a high-risk region.
arXiv Detail & Related papers (2025-06-09T03:52:43Z) - LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts [88.96201324719205]
Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training.<n>We identify a new safety vulnerability in LLMs, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms.<n>We introduce a novel attack method, textitActorBreaker, which identifies actors related to toxic prompts within pre-training distribution.
arXiv Detail & Related papers (2024-10-14T16:41:49Z) - PILLAR: an AI-Powered Privacy Threat Modeling Tool [2.2366638308792735]
PILLAR is a new tool that integrates Large Language Models with the LINDDUN framework to streamline and enhance privacy threat modeling.
PILLAR automates key parts of the LINDDUN process, such as generating DFDs, classifying threats, and prioritizing risks.
arXiv Detail & Related papers (2024-10-11T12:13:03Z) - Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts [57.49685172971446]
We present Rainbow Teaming, a novel black-box approach for producing a diverse collection of adversarial prompts.<n>Our approach reveals hundreds of effective adversarial prompts, with an attack success rate exceeding 90%.<n>We additionally explore the versatility of Rainbow Teaming by applying it to question answering and cybersecurity.
arXiv Detail & Related papers (2024-02-26T18:47:27Z) - Not what you've signed up for: Compromising Real-World LLM-Integrated
Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications.
We show how attackers can override original instructions and employed controls using Prompt Injection attacks.
We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.