GPT-DETOX: An In-Context Learning-Based Paraphraser for Text Detoxification
- URL: http://arxiv.org/abs/2404.03052v1
- Date: Wed, 3 Apr 2024 20:35:36 GMT
- Title: GPT-DETOX: An In-Context Learning-Based Paraphraser for Text Detoxification
- Authors: Ali Pesaranghader, Nikhil Verma, Manasa Bharadwaj,
- Abstract summary: We propose GPT-DETOX as a framework for prompt-based in-context learning for text detoxification using GPT-3.5 Turbo.
To generate few-shot prompts, we propose two methods: word-matching example selection (WMES) and context-matching example selection (CMES)
We take into account ensemble in-context learning (EICL) where the ensemble is shaped by base prompts from zero-shot and all few-shot settings.
- Score: 1.8295720742100332
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Harmful and offensive communication or content is detrimental to social bonding and the mental state of users on social media platforms. Text detoxification is a crucial task in natural language processing (NLP), where the goal is removing profanity and toxicity from text while preserving its content. Supervised and unsupervised learning are common approaches for designing text detoxification solutions. However, these methods necessitate fine-tuning, leading to computational overhead. In this paper, we propose GPT-DETOX as a framework for prompt-based in-context learning for text detoxification using GPT-3.5 Turbo. We utilize zero-shot and few-shot prompting techniques for detoxifying input sentences. To generate few-shot prompts, we propose two methods: word-matching example selection (WMES) and context-matching example selection (CMES). We additionally take into account ensemble in-context learning (EICL) where the ensemble is shaped by base prompts from zero-shot and all few-shot settings. We use ParaDetox and APPDIA as benchmark detoxification datasets. Our experimental results show that the zero-shot solution achieves promising performance, while our best few-shot setting outperforms the state-of-the-art models on ParaDetox and shows comparable results on APPDIA. Our EICL solutions obtain the greatest performance, adding at least 10% improvement, against both datasets.
Related papers
- Fine-Grained Detoxification via Instance-Level Prefixes for Large
Language Models [26.474136481185724]
Fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost.
FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt.
We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels.
arXiv Detail & Related papers (2024-02-23T09:04:48Z) - DiffuDetox: A Mixed Diffusion Model for Text Detoxification [12.014080113339178]
Text detoxification is a conditional text generation task aiming to remove offensive content from toxic text.
We propose DiffuDetox, a mixed conditional and unconditional diffusion model for text detoxification.
arXiv Detail & Related papers (2023-06-14T13:41:23Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Improving Textless Spoken Language Understanding with Discrete Units as
Intermediate Target [58.59044226658916]
Spoken Language Understanding (SLU) is a task that aims to extract semantic information from spoken utterances.
We propose to use discrete units as intermediate guidance to improve textless SLU performance.
arXiv Detail & Related papers (2023-05-29T14:00:24Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Lex2Sent: A bagging approach to unsupervised sentiment analysis [0.628122931748758]
In this paper, we propose an alternative approach to classifying texts: Lex2Sent.
To classify texts, we train embedding models to determine the distances between document embeddings and the embeddings of a suitable lexicon.
We show that our model outperforms lexica and provides a basis for a high performing few-shot fine-tuning approach in the task of binary sentiment analysis.
arXiv Detail & Related papers (2022-09-26T20:49:18Z) - Cisco at SemEval-2021 Task 5: What's Toxic?: Leveraging Transformers for
Multiple Toxic Span Extraction from Online Comments [1.332560004325655]
This paper describes the system proposed by team Cisco for SemEval-2021 Task 5: Toxic Spans Detection.
We approach this problem primarily in two ways: a sequence tagging approach and a dependency parsing approach.
Our best performing architecture in this approach also proved to be our best performing architecture overall with an F1 score of 0.6922.
arXiv Detail & Related papers (2021-05-28T16:27:49Z) - Methods for Detoxification of Texts for the Russian Language [55.337471467610094]
We introduce the first study of automatic detoxification of Russian texts to combat offensive language.
We test two types of models - unsupervised approach that performs local corrections and supervised approach based on pretrained language GPT-2 model.
The results show that the tested approaches can be successfully used for detoxification, although there is room for improvement.
arXiv Detail & Related papers (2021-05-19T10:37:44Z) - Text Mining to Identify and Extract Novel Disease Treatments From
Unstructured Datasets [56.38623317907416]
We use Google Cloud to transcribe podcast episodes of an NPR radio show.
We then build a pipeline for systematically pre-processing the text.
Our model successfully identified that Omeprazole can help treat heartburn.
arXiv Detail & Related papers (2020-10-22T19:52:49Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.