Related papers: LLM-based Rewriting of Inappropriate Argumentation using Reinforcement Learning from Machine Feedback

LLM-based Rewriting of Inappropriate Argumentation using Reinforcement Learning from Machine Feedback

URL: http://arxiv.org/abs/2406.03363v1
Date: Wed, 5 Jun 2024 15:18:08 GMT
Title: LLM-based Rewriting of Inappropriate Argumentation using Reinforcement Learning from Machine Feedback
Authors: Timon Ziegenbein, Gabriella Skitalinskaya, Alireza Bayat Makou, Henning Wachsmuth,
Abstract summary: This paper studies how inappropriate language in arguments can be computationally mitigated. We propose a reinforcement learning-based rewriting approach that balances content preservation and appropriateness. We evaluate different weighting schemes for the reward function in both absolute and relative human assessment studies.
Score: 16.57980268646285
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Ensuring that online discussions are civil and productive is a major challenge for social media platforms. Such platforms usually rely both on users and on automated detection tools to flag inappropriate arguments of other users, which moderators then review. However, this kind of post-hoc moderation is expensive and time-consuming, and moderators are often overwhelmed by the amount and severity of flagged content. Instead, a promising alternative is to prevent negative behavior during content creation. This paper studies how inappropriate language in arguments can be computationally mitigated. We propose a reinforcement learning-based rewriting approach that balances content preservation and appropriateness based on existing classifiers, prompting an instruction-finetuned large language model (LLM) as our initial policy. Unlike related style transfer tasks, rewriting inappropriate arguments allows deleting and adding content permanently. It is therefore tackled on document level rather than sentence level. We evaluate different weighting schemes for the reward function in both absolute and relative human assessment studies. Systematic experiments on non-parallel data provide evidence that our approach can mitigate the inappropriateness of arguments while largely preserving their content. It significantly outperforms competitive baselines, including few-shot learning, prompting, and humans.

Related papers

Incivility and Rigidity: The Risks of Fine-Tuning LLMs for Political Argumentation [11.255011967393838]
Incivility prevalent on platforms like Twitter (now X) and Reddit poses a challenge for developing AI systems.<n>In this study, we report experiments with GPT-3.5 Turbo, fine-tuned on two contrasting datasets of political discussions.<n>We show that Reddit-finetuned models produce safer but rhetorically rigid arguments, while cross-platform fine-tuning amplifies toxicity.
arXiv Detail & Related papers (2024-11-25T15:28:11Z)
The Art of Saying No: Contextual Noncompliance in Language Models [123.383993700586]
We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests. To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts.
arXiv Detail & Related papers (2024-07-02T07:12:51Z)
Protecting Copyrighted Material with Unique Identifiers in Large Language Model Training [55.321010757641524]
A primary concern regarding training large language models (LLMs) is whether they abuse copyrighted online text.<n>We propose an alternative textitinsert-and-detect methodology, advocating that web users and content platforms employ textbftextitunique identifiers for reliable and independent membership inference.
arXiv Detail & Related papers (2024-03-23T06:36:32Z)
Consolidating Strategies for Countering Hate Speech Using Persuasive Dialogues [3.8979646385036175]
We explore controllable strategies for generating counter-arguments to hateful comments in online conversations. Using automatic and human evaluations, we determine the best combination of features that generate fluent, argumentative, and logically sound arguments. We share developed computational models for automatically annotating text with such features, and a silver-standard annotated version of an existing hate speech dialog corpora.
arXiv Detail & Related papers (2024-01-15T16:31:18Z)
CASA: Causality-driven Argument Sufficiency Assessment [79.13496878681309]
We propose CASA, a zero-shot causality-driven argument sufficiency assessment framework. PS measures how likely introducing the premise event would lead to the conclusion when both the premise and conclusion events are absent. Experiments on two logical fallacy detection datasets demonstrate that CASA accurately identifies insufficient arguments.
arXiv Detail & Related papers (2024-01-10T16:21:18Z)
Argue with Me Tersely: Towards Sentence-Level Counter-Argument Generation [62.069374456021016]
We present the ArgTersely benchmark for sentence-level counter-argument generation. We also propose Arg-LlaMA for generating high-quality counter-argument.
arXiv Detail & Related papers (2023-12-21T06:51:34Z)
Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant. Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z)
Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind. A final factuality score is computed by an adjustable scoring mechanism. Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z)
QRelScore: Better Evaluating Generated Questions with Deeper Understanding of Context-aware Relevance [54.48031346496593]
We propose $textbfQRelScore$, a context-aware evaluation metric for $underlinetextbfRel$evance evaluation metric. Based on off-the-shelf language models such as BERT and GPT2, QRelScore employs both word-level hierarchical matching and sentence-level prompt-based generation. Compared with existing metrics, our experiments demonstrate that QRelScore is able to achieve a higher correlation with human judgments while being much more robust to adversarial samples.
arXiv Detail & Related papers (2022-04-29T07:39:53Z)
Data Expansion using Back Translation and Paraphrasing for Hate Speech Detection [1.192436948211501]
We present a new deep learning-based method that fuses a Back Translation method, and a Paraphrasing technique for data augmentation. We evaluate our proposal on five publicly available datasets; namely, AskFm corpus, Formspring dataset, Warner and Waseem dataset, Olid, and Wikipedia toxic comments dataset.
arXiv Detail & Related papers (2021-05-25T09:52:42Z)
Stay on Topic, Please: Aligning User Comments to the Content of a News Article [7.3203631241415055]
We propose a classification algorithm to categorize user comments posted to a new article base don their alignment to its content. The alignment seek to match user comments to an article based on similarity off content, entities in discussion, and topic. We conduct a user study to evaluate human labeling performance to understand the difficulty of the classification task.
arXiv Detail & Related papers (2021-03-03T18:29:00Z)
Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language. We generate abstractive summaries of narrated instructional videos across a wide variety of topics. We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection [0.0]
We propose an original framework, based on the Wikipedia Comment corpus, with comment-level annotations of different types. This large corpus of more than 380k annotated messages opens perspectives for online abuse detection and especially for context-based approaches. We also propose, in addition to this corpus, a complete benchmarking platform to stimulate and fairly compare scientific works around the problem of content abuse detection.
arXiv Detail & Related papers (2020-03-13T10:26:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.