GreenLLaMA: A Framework for Detoxification with Explanations
- URL: http://arxiv.org/abs/2402.15951v1
- Date: Sun, 25 Feb 2024 01:56:47 GMT
- Title: GreenLLaMA: A Framework for Detoxification with Explanations
- Authors: Md Tawkat Islam Khondaker, Muhammad Abdul-Mageed, Laks V. S.
Lakshmanan
- Abstract summary: We propose GreenLLaMA, the first comprehensive end-to-end detoxification framework.
We first introduce a cross-platform pseudo-parallel corpus applying multi-step data processing and generation strategies.
We show that our detoxification models outperform the SoTA model trained with human-annotated parallel corpus.
- Score: 28.294040692442618
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prior works on detoxification are scattered in the sense that they do not
cover all aspects of detoxification needed in a real-world scenario. Notably,
prior works restrict the task of developing detoxification models to only a
seen subset of platforms, leaving the question of how the models would perform
on unseen platforms unexplored. Additionally, these works do not address
non-detoxifiability, a phenomenon whereby the toxic text cannot be detoxified
without altering the meaning. We propose GreenLLaMA, the first comprehensive
end-to-end detoxification framework, which attempts to alleviate the
aforementioned limitations. We first introduce a cross-platform pseudo-parallel
corpus applying multi-step data processing and generation strategies leveraging
ChatGPT. We then train a suite of detoxification models with our cross-platform
corpus. We show that our detoxification models outperform the SoTA model
trained with human-annotated parallel corpus. We further introduce explanation
to promote transparency and trustworthiness. GreenLLaMA additionally offers a
unique paraphrase detector especially dedicated for the detoxification task to
tackle the non-detoxifiable cases. Through experimental analysis, we
demonstrate the effectiveness of our cross-platform corpus and the robustness
of GreenLLaMA against adversarial toxicity.
Related papers
- MultiParaDetox: Extending Text Detoxification with Parallel Data to New Languages [71.50809576484288]
Text detoxification is a task where a text is paraphrased from a toxic surface form, e.g. featuring rude words, to the neutral register.
Recent approaches for parallel text detoxification corpora collection -- ParaDetox and APPADIA -- were explored only in monolingual setup.
In this work, we aim to extend ParaDetox pipeline to multiple languages presenting MultiParaDetox to automate parallel detoxification corpus collection for potentially any language.
arXiv Detail & Related papers (2024-04-02T15:32:32Z) - ToXCL: A Unified Framework for Toxic Speech Detection and Explanation [3.803993344850168]
ToXCL is a unified framework for the detection and explanation of implicit toxic speech.
ToXCL achieves new state-of-the-art effectiveness, and outperforms baselines significantly.
arXiv Detail & Related papers (2024-03-25T12:21:38Z) - Detoxifying Large Language Models via Knowledge Editing [57.0669577257301]
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs)
We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts.
We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently.
arXiv Detail & Related papers (2024-03-21T15:18:30Z) - Fine-Grained Detoxification via Instance-Level Prefixes for Large
Language Models [26.474136481185724]
Fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost.
FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt.
We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels.
arXiv Detail & Related papers (2024-02-23T09:04:48Z) - Parameter-Efficient Detoxification with Contrastive Decoding [78.5124331048714]
We introduce Detoxification Generator (DETOXIGEN), an inference-time algorithm that steers the generation away from unwanted styles.
During the actual generation, we use the trained detoxifier to produce undesirable tokens for the generator to contrast against at each decoding step.
We find that it significantly outperforms previous approaches in detoxification metrics while not compromising on the generation quality.
arXiv Detail & Related papers (2024-01-13T01:46:20Z) - Exploring Methods for Cross-lingual Text Style Transfer: The Case of
Text Detoxification [77.45995868988301]
Text detoxification is the task of transferring the style of text from toxic to neutral.
We present a large-scale study of strategies for cross-lingual text detoxification.
arXiv Detail & Related papers (2023-11-23T11:40:28Z) - CMD: a framework for Context-aware Model self-Detoxification [25.02108563221933]
Text detoxification aims to minimize the risk of language models producing toxic content.
Existing detoxification methods fail to achieve a decent balance between detoxification effectiveness and generation quality.
We introduce a Context-aware Model self-Detoxification(CMD) framework that pays attention to both the context and the detoxification process.
arXiv Detail & Related papers (2023-08-16T11:50:38Z) - DiffuDetox: A Mixed Diffusion Model for Text Detoxification [12.014080113339178]
Text detoxification is a conditional text generation task aiming to remove offensive content from toxic text.
We propose DiffuDetox, a mixed conditional and unconditional diffusion model for text detoxification.
arXiv Detail & Related papers (2023-06-14T13:41:23Z) - Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information.
By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples.
We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z) - Detoxifying Text with MaRCo: Controllable Revision with Experts and
Anti-Experts [57.38912708076231]
We introduce MaRCo, a detoxification algorithm that combines controllable generation and text rewriting methods.
MaRCo uses likelihoods under a non-toxic LM and a toxic LM to find candidate words to mask and potentially replace.
We evaluate our method on several subtle toxicity and microaggressions datasets, and show that it not only outperforms baselines on automatic metrics, but MaRCo's rewrites are preferred 2.1 $times$ more in human evaluation.
arXiv Detail & Related papers (2022-12-20T18:50:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.