Related papers: TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis

TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis

URL: http://arxiv.org/abs/2505.24672v1
Date: Fri, 30 May 2025 15:02:21 GMT
Title: TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis
Authors: Xiaorui Wu, Xiaofeng Mao, Fei Li, Xin Zhang, Xuanhong Li, Chong Teng, Donghong Ji, Zhuang Li,
Abstract summary: Large Language Models (LLMs) excel in various natural language processing tasks but remain vulnerable to generating harmful content or being exploited for malicious purposes.<n>We propose a novel analysis framework to measure the risk coverage of alignment datasets across three essential dimensions: Lexical Diversity, Malicious Intent, and Jailbreak Tactics.
Score: 35.2545408706656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) excel in various natural language processing tasks but remain vulnerable to generating harmful content or being exploited for malicious purposes. Although safety alignment datasets have been introduced to mitigate such risks through supervised fine-tuning (SFT), these datasets often lack comprehensive risk coverage. Most existing datasets focus primarily on lexical diversity while neglecting other critical dimensions. To address this limitation, we propose a novel analysis framework to systematically measure the risk coverage of alignment datasets across three essential dimensions: Lexical Diversity, Malicious Intent, and Jailbreak Tactics. We further introduce TRIDENT, an automated pipeline that leverages persona-based, zero-shot LLM generation to produce diverse and comprehensive instructions spanning these dimensions. Each harmful instruction is paired with an ethically aligned response, resulting in two datasets: TRIDENT-Core, comprising 26,311 examples, and TRIDENT-Edge, with 18,773 examples. Fine-tuning Llama 3.1-8B on TRIDENT-Edge demonstrates substantial improvements, achieving an average 14.29% reduction in Harm Score, and a 20% decrease in Attack Success Rate compared to the best-performing baseline model fine-tuned on the WildBreak dataset.

Related papers

EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models [23.11474404054016]
Many machine learning models are fine-tuned from large language models (LLMs) to achieve high performance in specialized domains like code generation, biomedical analysis, and mathematical problem solving.<n>We introduce EnchTable, a novel framework designed to transfer and maintain safety alignment in downstream LLMs without requiring extensive retraining.
arXiv Detail & Related papers (2025-11-13T02:26:59Z)
External Data Extraction Attacks against Retrieval-Augmented Large Language Models [70.47869786522782]
RAG has emerged as a key paradigm for enhancing large language models (LLMs)<n>RAG introduces new risks of external data extraction attacks (EDEAs), where sensitive or copyrighted data in its knowledge base may be extracted verbatim.<n>We present the first comprehensive study to formalize EDEAs against retrieval-augmented LLMs.
arXiv Detail & Related papers (2025-10-03T12:53:45Z)
GRAID: Synthetic Data Generation with Geometric Constraints and Multi-Agentic Reflection for Harmful Content Detection [4.61489054791777]
We introduce GRAID (Geometric and Reflective AI-Driven Data Augmentation), a novel pipeline for dataset augmentation.<n>GRAID consists of two stages: (i) generation of geometrically controlled examples using a constrained LLM, and (ii) augmentation through a multi-agentic reflective process.<n>We demonstrate that augmenting a harmful text classification dataset with GRAID leads to significant improvements in downstream guardrail model performance.
arXiv Detail & Related papers (2025-08-23T15:09:58Z)
TCDiff: Triplex Cascaded Diffusion for High-fidelity Multimodal EHRs Generation with Incomplete Clinical Data [7.661128607911307]
We propose TCDiff, a novel EHR generation framework that cascades three diffusion networks to learn the features of real-world EHR data.<n> TCDiff consistently outperforms state-of-the-art baselines by an average of 10% in data fidelity under various missing rate.<n>This highlights the effectiveness, robustness, and generalizability of our approach in real-world healthcare scenarios.
arXiv Detail & Related papers (2025-08-03T06:24:20Z)
Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning [22.13346397293792]
Vulnerability-Aware Alignment estimates data vulnerability, partitions data into "vulnerable" and "invulnerable" groups, and encourages balanced learning.<n>VAA significantly reduces harmful scores while preserving downstream task performance, outperforming state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-04T11:33:36Z)
OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models [12.848214683467297]
Large language models (LLMs) trained over extensive corpora risk memorizing sensitive, copyrighted, or toxic content.<n>We propose OBLIVIATE, a robust unlearning framework that removes targeted data while preserving model utility.<n>We conduct experiments on multiple datasets, including the Harry Potter series, WMDP, and TOFU.
arXiv Detail & Related papers (2025-05-07T13:51:42Z)
CyberLLMInstruct: A New Dataset for Analysing Safety of Fine-Tuned LLMs Using Cyber Security Data [2.2530496464901106]
integration of large language models into cyber security applications presents significant opportunities.<n>CyberLLMInstruct is a dataset of 54,928 instruction-response pairs spanning cyber security tasks.<n>Fine-tuning models can achieve up to 92.50 percent accuracy on the CyberMetric benchmark.
arXiv Detail & Related papers (2025-03-12T12:29:27Z)
Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations.<n>They generate only a limited range of perturbations for a single Information Extraction (IE) task.<n>Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench.<n>We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z)
Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness [65.01625761120924]
We argue that a valuable sample should be informative of the task, non-redundant, and represent the sample distribution (i.e., not an outlier)<n>We propose a collaborative framework, DataTailor, which leverages three key principles--informativeness, uniqueness, and representativeness--for effective data selection.<n>Experiments on various benchmarks demonstrate that DataTailor achieves 100.8% of the performance of full-data fine-tuning with only 15% of the data.
arXiv Detail & Related papers (2024-12-09T08:36:10Z)
Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning [1.3307486544794784]
Red teaming/Safety alignment efforts show that fine-tuning models on benign (non-harmful) data could compromise safety. This paper explores the task-wise safety degradation due to fine-tuning on downstream tasks such as summarization, code generation, translation, and classification. Our work underscores the need for generalized alignment measures to ensure safer and more robust models.
arXiv Detail & Related papers (2024-09-18T08:04:24Z)
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming [0.0]
We introduce SAGE, a novel pipeline for generating synthetic alignment and red-teaming data. SAGE uses a detailed taxonomy to produce safety-alignment and red-teaming data across a wide range of topics. We show that the red-teaming data generated through SAGE jailbreaks state-of-the-art LLMs in more than 27 out of 32 sub-categories, and in more than 58 out of 279 leaf-categories.
arXiv Detail & Related papers (2024-08-14T08:38:31Z)
Exploring RAG-based Vulnerability Augmentation with LLMs [19.45598962972431]
VulScribeR is a novel solution that leverages carefully curated prompt templates to augment vulnerable datasets.<n>Our approach beats two SOTA methods Vulgen and VGX, and Random Oversampling (ROS) by 27.48%, 27.93%, and 15.41% in f1-score with 5K generated vulnerable samples on average.
arXiv Detail & Related papers (2024-08-07T23:22:58Z)
Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models [93.08860674071636]
We show how malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster dangerous model behaviors.<n>We propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data.
arXiv Detail & Related papers (2024-06-12T18:33:11Z)
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints [68.82294911302579]
We introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity.<n>Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization.
arXiv Detail & Related papers (2024-05-29T12:12:09Z)
Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity [80.16488817177182]
GNNs are vulnerable to the model stealing attack, a nefarious endeavor geared towards duplicating the target model via query permissions. We introduce three model stealing attacks to adapt to different actual scenarios.
arXiv Detail & Related papers (2023-12-18T05:42:31Z)
Benchmarking the Robustness of LiDAR Semantic Segmentation Models [78.6597530416523]
In this paper, we aim to comprehensively analyze the robustness of LiDAR semantic segmentation models under various corruptions. We propose a new benchmark called SemanticKITTI-C, which features 16 out-of-domain LiDAR corruptions in three groups, namely adverse weather, measurement noise and cross-device discrepancy. We design a robust LiDAR segmentation model (RLSeg) which greatly boosts the robustness with simple but effective modifications.
arXiv Detail & Related papers (2023-01-03T06:47:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.