ToxiLab: How Well Do Open-Source LLMs Generate Synthetic Toxicity Data?
- URL: http://arxiv.org/abs/2411.15175v4
- Date: Sat, 22 Feb 2025 16:42:03 GMT
- Title: ToxiLab: How Well Do Open-Source LLMs Generate Synthetic Toxicity Data?
- Authors: Zheng Hui, Zhaoxiao Guo, Hang Zhao, Juanyong Duan, Lin Ai, Yinheng Li, Julia Hirschberg, Congrui Huang,
- Abstract summary: This study explores the potential of open-source LLMs for harmful data synthesis.<n>We evaluate their ability to generate diverse, high-quality harmful data while minimizing hallucination and duplication.<n>Our findings demonstrate that fine-tuned open source LLMs provide scalable and cost-effective solutions to augment toxic content detection datasets.
- Score: 29.23490658406256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effective toxic content detection relies heavily on high-quality and diverse data, which serve as the foundation for robust content moderation models. Synthetic data has become a common approach for training models across various NLP tasks. However, its effectiveness remains uncertain for highly subjective tasks like hate speech detection, with previous research yielding mixed results. This study explores the potential of open-source LLMs for harmful data synthesis, utilizing controlled prompting and supervised fine-tuning techniques to enhance data quality and diversity. We systematically evaluated 6 open source LLMs on 5 datasets, assessing their ability to generate diverse, high-quality harmful data while minimizing hallucination and duplication. Our results show that Mistral consistently outperforms other open models, and supervised fine-tuning significantly enhances data reliability and diversity. We further analyze the trade-offs between prompt-based vs. fine-tuned toxic data synthesis, discuss real-world deployment challenges, and highlight ethical considerations. Our findings demonstrate that fine-tuned open source LLMs provide scalable and cost-effective solutions to augment toxic content detection datasets, paving the way for more accessible and transparent content moderation tools.
Related papers
- What Matters in LLM-generated Data: Diversity and Its Effect on Model Fine-Tuning [22.43647238560673]
We show how varying levels of diversity in LLM-generated data affect downstream model performance.<n>We also investigate the performance of models trained on data that mixes different proportions of LLM-generated data.
arXiv Detail & Related papers (2025-06-24T02:44:58Z) - Aligning Language Models with Observational Data: Opportunities and Risks from a Causal Perspective [0.0]
We study the challenges and opportunities of fine-tuning large language models using observational data.<n>We show that while observational outcomes can provide valuable supervision, directly fine-tuning models on such data can lead them to learn spurious correlations.<n>We propose DeconfoundLM, a method that explicitly removes the effect of known confounders from reward signals.
arXiv Detail & Related papers (2025-05-30T18:44:09Z) - PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing [7.760708840164335]
We propose PoisonSwarm, which applies the model crowdsourcing strategy to generate diverse harmful data.<n>We decompose each based template into multiple semantic units and perform unit-by-unit toxification.<n>Experiments demonstrate that PoisonSwarm achieves state-of-the-art performance in synthesizing different categories of harmful data.
arXiv Detail & Related papers (2025-05-27T13:33:57Z) - Few-shot LLM Synthetic Data with Distribution Matching [37.55363714371521]
Large language models (LLMs) produce high-quality synthetic data to enhance the performance of smaller models.
LLMs-generated synthetic data often differs from the real data in key language attributes.
We introduce SynAlign: a synthetic data generation and filtering framework based on key attribute distribution matching.
arXiv Detail & Related papers (2025-02-09T16:43:32Z) - Large Language Models for Market Research: A Data-augmentation Approach [3.3199591445531453]
Large Language Models (LLMs) have transformed artificial intelligence by excelling in complex natural language processing tasks.
Recent studies highlight a significant gap between LLM-generated and human data, with biases introduced when substituting between the two.
We propose a novel statistical data augmentation approach that efficiently integrates LLM-generated data with real data in conjoint analysis.
arXiv Detail & Related papers (2024-12-26T22:06:29Z) - Evaluating Language Models as Synthetic Data Generators [74.80905172696366]
AgoraBench is a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities.
Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities.
arXiv Detail & Related papers (2024-12-04T19:20:32Z) - Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch [54.12139707822201]
We propose ScaleQuest, a novel, scalable, and cost-effective data synthesis method.<n>By generating diverse questions from scratch, we produce a dataset of 1 million problem-solution pairs.<n>Our experiments demonstrate that models trained on our data outperform existing open-source datasets.
arXiv Detail & Related papers (2024-10-24T12:42:04Z) - On the Diversity of Synthetic Data and its Impact on Training Large Language Models [34.00031258223175]
Large Language Models (LLMs) have accentuated the need for diverse, high-quality pre-training data.
Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility.
We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages.
arXiv Detail & Related papers (2024-10-19T22:14:07Z) - LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models [55.903148392998965]
We introduce LOKI, a novel benchmark designed to evaluate the ability of LMMs to detect synthetic data across multiple modalities.
The benchmark includes coarse-grained judgment and multiple-choice questions, as well as fine-grained anomaly selection and explanation tasks.
We evaluate 22 open-source LMMs and 6 closed-source models on LOKI, highlighting their potential as synthetic data detectors and also revealing some limitations in the development of LMM capabilities.
arXiv Detail & Related papers (2024-10-13T05:26:36Z) - Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models [79.65071553905021]
We propose Data Advisor, a method for generating data that takes into account the characteristics of the desired dataset.
Data Advisor monitors the status of the generated data, identifies weaknesses in the current dataset, and advises the next iteration of data generation.
arXiv Detail & Related papers (2024-10-07T17:59:58Z) - Enhancing Unsupervised Sentence Embeddings via Knowledge-Driven Data Augmentation and Gaussian-Decayed Contrastive Learning [37.54523122932728]
We propose a pipeline-based data augmentation method via large language models (LLMs)
To tackle the issue of low data diversity, our pipeline utilizes knowledge graphs (KGs) to extract entities and quantities.
To address high data noise, the GCSE model uses a Gaussian-decayed function to limit the impact of false hard negative samples.
arXiv Detail & Related papers (2024-09-19T16:29:58Z) - Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss.
Based on the findings of the entropy law, we propose a quite efficient and universal data selection method.
We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - SHED: Shapley-Based Automated Dataset Refinement for Instruction Fine-Tuning [16.307467144690683]
Large Language Models can achieve desirable performance with only a small amount of high-quality data.
Identifying high-quality data from vast datasets to curate small yet effective datasets has emerged as a critical challenge.
We introduce SHED, an automated dataset refinement framework based on Shapley value for instruction fine-tuning.
arXiv Detail & Related papers (2024-04-23T04:56:48Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named
Entity Recognition [67.96794382040547]
$LLM-DA$ is a novel data augmentation technique based on large language models (LLMs) for the few-shot NER task.
Our approach involves employing 14 contextual rewriting strategies, designing entity replacements of the same type, and incorporating noise injection to enhance robustness.
arXiv Detail & Related papers (2024-02-22T14:19:56Z) - ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs [65.9625653425636]
Large Language models (LLMs) exhibit harmful social biases.
This work introduces a novel approach utilizing ChatGPT to generate synthetic training data.
arXiv Detail & Related papers (2024-02-19T01:28:48Z) - Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime.
We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.