ToxiCraft: A Novel Framework for Synthetic Generation of Harmful Information
- URL: http://arxiv.org/abs/2409.14740v2
- Date: Mon, 14 Apr 2025 18:30:57 GMT
- Title: ToxiCraft: A Novel Framework for Synthetic Generation of Harmful Information
- Authors: Zheng Hui, Zhaoxiao Guo, Hang Zhao, Juanyong Duan, Congrui Huang,
- Abstract summary: Toxicraft is a novel framework for synthesizing datasets of harmful information.<n>With only a small amount of seed data, our framework can generate a wide variety of synthetic, yet remarkably realistic, examples of toxic information.
- Score: 30.333357539780287
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In different NLP tasks, detecting harmful content is crucial for online environments, especially with the growing influence of social media. However, previous research has two main issues: 1) a lack of data in low-resource settings, and 2) inconsistent definitions and criteria for judging harmful content, requiring classification models to be robust to spurious features and diverse. We propose Toxicraft, a novel framework for synthesizing datasets of harmful information to address these weaknesses. With only a small amount of seed data, our framework can generate a wide variety of synthetic, yet remarkably realistic, examples of toxic information. Experimentation across various datasets showcases a notable enhancement in detection model robustness and adaptability, surpassing or close to the gold labels.
Related papers
- PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing [7.760708840164335]
We propose PoisonSwarm, which applies the model crowdsourcing strategy to generate diverse harmful data.<n>We decompose each based template into multiple semantic units and perform unit-by-unit toxification.<n>Experiments demonstrate that PoisonSwarm achieves state-of-the-art performance in synthesizing different categories of harmful data.
arXiv Detail & Related papers (2025-05-27T13:33:57Z) - What's Wrong with Your Synthetic Tabular Data? Using Explainable AI to Evaluate Generative Models [1.024113475677323]
We apply explainable AI (XAI) techniques to a binary detection classifier trained to distinguish real from synthetic data.
While the classifier identifies distributional differences, XAI concepts, analyzed through methods like permutation feature importance, partial dependence plots, Shapley values, reveal why synthetic data are distinguishable.
This interpretability increases transparency in synthetic data evaluation and provides deeper insights beyond conventional metrics.
arXiv Detail & Related papers (2025-04-29T12:10:52Z) - Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation [15.355814393928707]
We put forward a unified dataset tailored for social media content moderation across six sensitive categories.
These include conflictual language, profanity, sexually explicit material, drug-related content, self-harm, and spam.
Fine-tuning large language models on this novel dataset yields significant improvements in detection performance compared to open off-the-shelf models.
arXiv Detail & Related papers (2024-11-29T16:44:02Z) - Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding [61.89781979702939]
This study quantitatively reveals an "impossible trinity" among data quantity, diversity, and quality in pre-training datasets.
Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations.
We introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods.
arXiv Detail & Related papers (2024-09-29T03:33:35Z) - ToVo: Toxicity Taxonomy via Voting [25.22398575368979]
We propose a dataset creation mechanism that integrates voting and chain-of-thought processes.
Our methodology ensures diverse classification metrics for each sample.
We utilize the dataset created through our proposed mechanism to train our model.
arXiv Detail & Related papers (2024-06-21T02:35:30Z) - Generation of synthetic data using breast cancer dataset and classification with resnet18 [0.0]
Synthetic data is required for a number of reasons, including the constraints of real data, the expense of collecting labeled data, and privacy and security problems.
A deep learning model called GAN (Generative Adversarial Networks) has been developed with the intention of generating synthetic data.
In this study, the Breast Histopathology dataset was used to generate malignant and negatively labeled synthetic patch images.
arXiv Detail & Related papers (2024-05-25T15:53:27Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity [80.16488817177182]
GNNs are vulnerable to the model stealing attack, a nefarious endeavor geared towards duplicating the target model via query permissions.
We introduce three model stealing attacks to adapt to different actual scenarios.
arXiv Detail & Related papers (2023-12-18T05:42:31Z) - Image change detection with only a few samples [7.5780621370948635]
A major impediment of image change detection task is the lack of large annotated datasets covering a wide variety of scenes.
We propose using simple image processing methods for generating synthetic but informative datasets.
We then design an early fusion network based on object detection which could outperform the siamese neural network.
arXiv Detail & Related papers (2023-11-07T07:01:35Z) - A Discrepancy Aware Framework for Robust Anomaly Detection [51.710249807397695]
We present a Discrepancy Aware Framework (DAF), which demonstrates robust performance consistently with simple and cheap strategies.
Our method leverages an appearance-agnostic cue to guide the decoder in identifying defects, thereby alleviating its reliance on synthetic appearance.
Under the simple synthesis strategies, it outperforms existing methods by a large margin. Furthermore, it also achieves the state-of-the-art localization performance.
arXiv Detail & Related papers (2023-10-11T15:21:40Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets [83.749895930242]
We propose two techniques for producing high-quality naturalistic synthetic occluded faces.
We empirically show the effectiveness and robustness of both methods, even for unseen occlusions.
We present two high-resolution real-world occluded face datasets with fine-grained annotations, RealOcc and RealOcc-Wild.
arXiv Detail & Related papers (2022-05-12T17:03:57Z) - Less is More: Learning from Synthetic Data with Fine-grained Attributes
for Person Re-Identification [16.107661617441327]
Person re-identification (re-ID) plays an important role in applications such as public security and video surveillance.
Recently, learning from synthetic data has attracted attention from both academia and the public eye.
We construct and label a large-scale synthetic person dataset named FineGPR with fine-grained attribute distribution.
arXiv Detail & Related papers (2021-09-22T03:12:32Z) - Attribute analysis with synthetic dataset for person re-identification [15.388939933009668]
Person re-identification (re-ID) plays an important role in applications such as public security and video surveillance.
Recently, learning from synthetic data, which benefits from the popularity of synthetic data engine, have achieved remarkable performance.
Existing synthetic datasets are in small size and lack of diversity, which hinders the development of person re-ID in real-world scenarios.
arXiv Detail & Related papers (2020-06-12T12:51:47Z) - Adversarial Feature Hallucination Networks for Few-Shot Learning [84.31660118264514]
Adversarial Feature Hallucination Networks (AFHN) is based on conditional Wasserstein Generative Adversarial networks (cWGAN)
Two novel regularizers are incorporated into AFHN to encourage discriminability and diversity of the synthesized features.
arXiv Detail & Related papers (2020-03-30T02:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.