Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs
- URL: http://arxiv.org/abs/2505.02009v2
- Date: Wed, 21 May 2025 06:50:54 GMT
- Title: Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs
- Authors: Sai Krishna Mendu, Harish Yenala, Aditi Gulati, Shanu Kumar, Parag Agrawal,
- Abstract summary: Large language models (LLMs) have become integral to various real-world applications, leveraging massive, web-sourced datasets like Common Crawl, C4, and FineWeb for pretraining.<n>Training LLMs on such unfiltered data risks perpetuating toxic behaviors, spreading misinformation, and amplifying societal biases.<n>This paper presents a large-scale analysis of inappropriate content across these datasets, offering a comprehensive taxonomy that categorizes harmful webpages into Topical and Toxic based on their intent.
- Score: 1.7451266777840306
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have become integral to various real-world applications, leveraging massive, web-sourced datasets like Common Crawl, C4, and FineWeb for pretraining. While these datasets provide linguistic data essential for high-quality natural language generation, they often contain harmful content, such as hate speech, misinformation, and biased narratives. Training LLMs on such unfiltered data risks perpetuating toxic behaviors, spreading misinformation, and amplifying societal biases which can undermine trust in LLM-driven applications and raise ethical concerns about their use. This paper presents a large-scale analysis of inappropriate content across these datasets, offering a comprehensive taxonomy that categorizes harmful webpages into Topical and Toxic based on their intent. We also introduce a prompt evaluation dataset, a high-accuracy Topical and Toxic Prompt (TTP), and a transformer-based model (HarmFormer) for harmful content filtering. Additionally, we create a new multi-harm open-ended toxicity benchmark (HAVOC) and provide crucial insights into how models respond to adversarial toxic inputs. We share TTP, TTP-Eval, HAVOC and a sample of C4 inferenced on HarmFormer. Our work offers insights into ensuring safer LLM pretraining and serves as a resource for Responsible AI (RAI) compliance.
Related papers
- Advancing Harmful Content Detection in Organizational Research: Integrating Large Language Models with Elo Rating System [0.0]
Large language models (LLMs) offer promising opportunities for organizational research.<n>Their built-in moderation systems can create problems when researchers try to analyze harmful content.<n>This paper introduces an Elo rating-based method that significantly improves LLM performance for harmful content analysis.
arXiv Detail & Related papers (2025-06-19T20:01:12Z) - ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark [50.89916747049978]
Existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope.<n>We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data.<n>We propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs.
arXiv Detail & Related papers (2025-06-12T17:57:05Z) - Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA [0.0]
This dataset removes 7,531 of toxic image-text pairs in the LLaVA pre-training dataset.<n>We offer guidelines for implementing robust toxicity detection pipelines.
arXiv Detail & Related papers (2025-05-09T18:01:50Z) - LLM-based Semantic Augmentation for Harmful Content Detection [5.954202581988127]
This paper introduces an approach that prompts large language models to clean noisy text and provide context-rich explanations.<n>We evaluate on the SemEval 2024 multi-label Persuasive Meme dataset and validate on the Google Jigsaw toxic comments and Facebook hateful memes datasets.<n>Our results reveal that zero-shot LLM classification underperforms on these high-context tasks compared to supervised models.
arXiv Detail & Related papers (2025-04-22T02:59:03Z) - ToxiLab: How Well Do Open-Source LLMs Generate Synthetic Toxicity Data? [29.23490658406256]
This study explores the potential of open-source LLMs for harmful data synthesis.<n>We evaluate their ability to generate diverse, high-quality harmful data while minimizing hallucination and duplication.<n>Our findings demonstrate that fine-tuned open source LLMs provide scalable and cost-effective solutions to augment toxic content detection datasets.
arXiv Detail & Related papers (2024-11-18T00:21:14Z) - Large Language Models can be Strong Self-Detoxifiers [82.6594169242814]
Self-disciplined Autoregressive Sampling (SASA) is a lightweight controlled decoding algorithm for toxicity reduction of large language models (LLMs)
SASA tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy.
evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks.
arXiv Detail & Related papers (2024-10-04T17:45:15Z) - ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2.
Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z) - "Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models [74.05368440735468]
Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs)
In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases.
arXiv Detail & Related papers (2024-06-26T05:36:23Z) - Realistic Evaluation of Toxicity in Large Language Models [28.580995165272086]
Large language models (LLMs) have become integral to our professional and daily lives.
The huge amount of data which endows them with vast and diverse knowledge exposes them to the inevitable toxicity and bias.
This paper introduces the new Thoroughly Engineered Toxicity dataset, comprising manually crafted prompts.
arXiv Detail & Related papers (2024-05-17T09:42:59Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - On the Risk of Misinformation Pollution with Large Language Models [127.1107824751703]
We investigate the potential misuse of modern Large Language Models (LLMs) for generating credible-sounding misinformation.
Our study reveals that LLMs can act as effective misinformation generators, leading to a significant degradation in the performance of Open-Domain Question Answering (ODQA) systems.
arXiv Detail & Related papers (2023-05-23T04:10:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.