Related papers: Text Detoxification: Data Efficiency, Semantic Preservation and Model Generalization

Text Detoxification: Data Efficiency, Semantic Preservation and Model Generalization

URL: http://arxiv.org/abs/2507.01050v2
Date: Mon, 07 Jul 2025 07:48:05 GMT
Title: Text Detoxification: Data Efficiency, Semantic Preservation and Model Generalization
Authors: Jing Yu, Yibo Zhao, Jiapeng Zhu, Wenming Shao, Bo Pang, Zhao Zhang, Xiang Li,
Abstract summary: The dissemination of toxic content on social media poses a serious threat to online environments and public discourse.<n>Existing approaches often struggle to simultaneously achieve strong detoxification performance, semantic preservation, and to out-of-distribution data.<n>We propose a two-stage training framework that jointly optimize for data efficiency, semantic preservation, and model generalization.
Score: 23.328207651816957
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The widespread dissemination of toxic content on social media poses a serious threat to both online environments and public discourse, highlighting the urgent need for detoxification methods that effectively remove toxicity while preserving the original semantics. However, existing approaches often struggle to simultaneously achieve strong detoxification performance, semantic preservation, and robustness to out-of-distribution data. Moreover, they typically rely on costly, manually annotated parallel corpora while showing poor data efficiency. To address these challenges, we propose a two-stage training framework that jointly optimizes for data efficiency, semantic preservation, and model generalization. We first perform supervised fine-tuning on a small set of high-quality, filtered parallel data to establish a strong initialization. Then, we leverage unlabeled toxic inputs and a custom-designed reward model to train the LLM using Group Relative Policy Optimization. Experimental results demonstrate that our method effectively mitigates the trade-offs faced by previous work, achieving state-of-the-art performance with improved generalization and significantly reduced dependence on annotated data. Our code is available at: https://github.com/allacnobug/Detoxification-of-Text.

Related papers

Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update [60.414548453838506]
We study the generalized linear bandit (GLB) problem, a contextual multi-armed bandit framework that extends the classical linear model by incorporating a non-linear link function.<n>GLBs are widely applicable to real-world scenarios, but their non-linear nature introduces significant challenges in achieving both computational and statistical efficiency.<n>We propose a jointly efficient algorithm that attains a nearly optimal regret bound with $mathcalO(1)$ time and space complexities per round.
arXiv Detail & Related papers (2025-07-16T02:24:21Z)
RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs [76.3459242819381]
RefineX is a novel framework for large-scale, surgical refinement of pre-training data through programmatic editing tasks.<n>The core strength of RefineX lies in distilling high-quality, expert-guided end-to-end refinement results into minimal edit-based deletion programs.<n>We evaluate RefineX across from-scratch pre-training at multiple model scales and find that it consistently outperforms models trained on raw, filtered, or alternatively refined data.
arXiv Detail & Related papers (2025-07-04T02:19:58Z)
Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models [7.566515311806724]
Large Language Models (LLMs) deployed in real-world settings increasingly face the need to unlearn sensitive, outdated, or proprietary information.<n>Existing unlearning methods formulate forgetting and retention as a regularized trade-off, combining both objectives into a single scalarized loss.<n>We propose a new formulation of LLM unlearning as a constrained optimization problem: forgetting is enforced via a novel logit-margin flattening loss.
arXiv Detail & Related papers (2025-06-05T17:55:23Z)
Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing [49.85884082568318]
ToxEdit is a toxicity-aware knowledge editing approach.<n>It dynamically detects toxic activation patterns during forward propagation.<n>It then routes computations through adaptive inter-layer pathways to mitigate toxicity effectively.
arXiv Detail & Related papers (2025-05-28T12:37:06Z)
Synthetic Data is an Elegant GIFT for Continual Vision-Language Models [52.343627275005026]
GIFT is a novel continual fine-tuning approach to overcome catastrophic forgetting in Vision-Language Models.<n>We employ a pre-trained diffusion model to recreate both pre-training and learned downstream task data.<n>Our method consistently outperforms previous state-of-the-art approaches across various settings.
arXiv Detail & Related papers (2025-03-06T09:09:18Z)
Soft-Label Integration for Robust Toxicity Classification [39.159343518702805]
This work introduces a novel bi-level optimization framework that integrates crowdsourced annotations with the soft-labeling technique. GroupDRO is used to enhance the robustness against out-of-distribution (OOD) risk. Experimental results demonstrate that our approach outperforms existing baseline methods in terms of both average and worst-group accuracy.
arXiv Detail & Related papers (2024-10-18T22:36:03Z)
Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models [21.341749351654453]
The generation of toxic content by large language models (LLMs) remains a critical challenge for the safe deployment of language technology.<n>We propose a novel framework for implicit knowledge editing and controlled text generation by fine-tuning LLMs with a prototype-based contrastive perplexity objective.
arXiv Detail & Related papers (2024-01-16T16:49:39Z)
Erasing Undesirable Influence in Diffusion Models [51.225365010401006]
Diffusion models are highly effective at generating high-quality images but pose risks, such as the unintentional generation of NSFW (not safe for work) content. In this work, we introduce EraseDiff, an algorithm designed to preserve the utility of the diffusion model on retained data while removing the unwanted information associated with the data to be forgotten.
arXiv Detail & Related papers (2024-01-11T09:30:36Z)
Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented Models [11.805944680474823]
Goodtriever is a flexible methodology that matches the current state-of-the-art toxicity mitigation. By incorporating a retrieval-based approach at decoding time, Goodtriever enables toxicity-controlled text generation.
arXiv Detail & Related papers (2023-10-11T15:30:35Z)
On Practical Aspects of Aggregation Defenses against Data Poisoning Attacks [58.718697580177356]
Attacks on deep learning models with malicious training samples are known as data poisoning. Recent advances in defense strategies against data poisoning have highlighted the effectiveness of aggregation schemes in achieving certified poisoning robustness. Here we focus on Deep Partition Aggregation, a representative aggregation defense, and assess its practical aspects, including efficiency, performance, and robustness.
arXiv Detail & Related papers (2023-06-28T17:59:35Z)
Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER) Our method exploits self-supervised pretraining to learn good feature representations from the target data. We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z)
Unified Detoxifying and Debiasing in Language Generation via Inference-time Adaptive Optimization [32.50246008433889]
Pre-trained language models (PLMs) have prospered in various natural language generation (NLG) tasks due to their ability to generate fairly fluent text. These models are observed to capture and reproduce harmful contents in training corpora, typically toxic language and social biases, raising severe moral issues. We propose the first unified framework of detoxifying and debiasing called UDDIA, which jointly formalizes these two problems as rectifying the output space.
arXiv Detail & Related papers (2022-10-10T08:45:25Z)
DEALIO: Data-Efficient Adversarial Learning for Imitation from Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator. Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms. This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk. We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z)
Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness [21.630597505797073]
We propose a novel and effective regularization term for adversarial data augmentation. We theoretically derive it from the information bottleneck principle, which results in a maximum-entropy formulation. Our method consistently outperforms the existing state of the art by a statistically significant margin.
arXiv Detail & Related papers (2020-10-15T20:02:23Z)
A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation [53.8171136907856]
We introduce a set of simple yet effective data augmentation strategies dubbed cutoff. cutoff relies on sampling consistency and thus adds little computational overhead. cutoff consistently outperforms adversarial training and achieves state-of-the-art results on the IWSLT2014 German-English dataset.
arXiv Detail & Related papers (2020-09-29T07:08:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.