WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection
- URL: http://arxiv.org/abs/2003.06190v1
- Date: Fri, 13 Mar 2020 10:26:45 GMT
- Title: WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection
- Authors: No\'e Cecillon (LIA), Vincent Labatut (LIA), Richard Dufour (LIA),
Georges Linares (LIA)
- Abstract summary: We propose an original framework, based on the Wikipedia Comment corpus, with comment-level annotations of different types.
This large corpus of more than 380k annotated messages opens perspectives for online abuse detection and especially for context-based approaches.
We also propose, in addition to this corpus, a complete benchmarking platform to stimulate and fairly compare scientific works around the problem of content abuse detection.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the spread of online social networks, it is more and more difficult to
monitor all the user-generated content. Automating the moderation process of
the inappropriate exchange content on Internet has thus become a priority task.
Methods have been proposed for this purpose, but it can be challenging to find
a suitable dataset to train and develop them. This issue is especially true for
approaches based on information derived from the structure and the dynamic of
the conversation. In this work, we propose an original framework, based on the
Wikipedia Comment corpus, with comment-level abuse annotations of different
types. The major contribution concerns the reconstruction of conversations, by
comparison to existing corpora, which focus only on isolated messages (i.e.
taken out of their conversational context). This large corpus of more than 380k
annotated messages opens perspectives for online abuse detection and especially
for context-based approaches. We also propose, in addition to this corpus, a
complete benchmarking platform to stimulate and fairly compare scientific works
around the problem of content abuse detection, trying to avoid the recurring
problem of result replication. Finally, we apply two classification methods to
our dataset to demonstrate its potential.
Related papers
- LLM-based Rewriting of Inappropriate Argumentation using Reinforcement Learning from Machine Feedback [16.57980268646285]
This paper studies how inappropriate language in arguments can be computationally mitigated.
We propose a reinforcement learning-based rewriting approach that balances content preservation and appropriateness.
We evaluate different weighting schemes for the reward function in both absolute and relative human assessment studies.
arXiv Detail & Related papers (2024-06-05T15:18:08Z) - SSP: Self-Supervised Post-training for Conversational Search [63.28684982954115]
We propose fullmodel (model) which is a new post-training paradigm with three self-supervised tasks to efficiently initialize the conversational search model.
To verify the effectiveness of our proposed method, we apply the conversational encoder post-trained by model on the conversational search task using two benchmark datasets: CAsT-19 and CAsT-20.
arXiv Detail & Related papers (2023-07-02T13:36:36Z) - Integrity and Junkiness Failure Handling for Embedding-based Retrieval:
A Case Study in Social Network Search [26.705196461992845]
Embedding based retrieval has seen its usage in a variety of search applications like e-commerce, social networking search etc.
In this paper, we conduct an analysis of embedding-based retrieval launched in early 2021 on our social network search engine.
We define two main categories of failures introduced by it, integrity and junkiness.
arXiv Detail & Related papers (2023-04-18T20:53:47Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Perplexed by Quality: A Perplexity-based Method for Adult and Harmful
Content Detection in Multilingual Heterogeneous Web Data [0.0]
We explore different methods for detecting adult and harmful of content in multilingual heterogeneous web data.
We train solely with adult and harmful textual data, and then select the documents having a perplexity value above a given threshold.
This approach will virtually cluster our documents into two distinct groups, which will greatly facilitate the choice of the threshold for the perplexity.
arXiv Detail & Related papers (2022-12-20T17:14:45Z) - A combined approach to the analysis of speech conversations in a contact
center domain [2.575030923243061]
We describe an experimentation with a speech analytics process for an Italian contact center, that deals with call recordings extracted from inbound or outbound flows.
First, we illustrate in detail the development of an in-house speech-to-text solution, based on Kaldi framework.
Then, we evaluate and compare different approaches to the semantic tagging of call transcripts.
Finally, a decision tree inducer, called J48S, is applied to the problem of tagging.
arXiv Detail & Related papers (2022-03-12T10:03:20Z) - Data Expansion using Back Translation and Paraphrasing for Hate Speech
Detection [1.192436948211501]
We present a new deep learning-based method that fuses a Back Translation method, and a Paraphrasing technique for data augmentation.
We evaluate our proposal on five publicly available datasets; namely, AskFm corpus, Formspring dataset, Warner and Waseem dataset, Olid, and Wikipedia toxic comments dataset.
arXiv Detail & Related papers (2021-05-25T09:52:42Z) - Unsupervised Summarization for Chat Logs with Topic-Oriented Ranking and
Context-Aware Auto-Encoders [59.038157066874255]
We propose a novel framework called RankAE to perform chat summarization without employing manually labeled data.
RankAE consists of a topic-oriented ranking strategy that selects topic utterances according to centrality and diversity simultaneously.
A denoising auto-encoder is designed to generate succinct but context-informative summaries based on the selected utterances.
arXiv Detail & Related papers (2020-12-14T07:31:17Z) - Detecting and Classifying Malevolent Dialogue Responses: Taxonomy, Data
and Methodology [68.8836704199096]
Corpus-based conversational interfaces are able to generate more diverse and natural responses than template-based or retrieval-based agents.
With their increased generative capacity of corpusbased conversational agents comes the need to classify and filter out malevolent responses.
Previous studies on the topic of recognizing and classifying inappropriate content are mostly focused on a certain category of malevolence.
arXiv Detail & Related papers (2020-08-21T22:43:27Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.