hBert + BiasCorp -- Fighting Racism on the Web
- URL: http://arxiv.org/abs/2104.02242v1
- Date: Tue, 6 Apr 2021 02:17:20 GMT
- Title: hBert + BiasCorp -- Fighting Racism on the Web
- Authors: Olawale Onabola, Zhuang Ma, Yang Xie, Benjamin Akera, Abdulrahman
Ibraheem, Jia Xue, Dianbo Liu, Yoshua Bengio
- Abstract summary: We are releasing BiasCorp, a dataset containing 139,090 comments and news segment from three specific sources - Fox News, BreitbartNews and YouTube.
In this work, we present hBERT, where we modify certain layers of the pretrained BERT model with the new Hopfield Layer.
We are also releasing a JavaScript library and a Chrome Extension Application, to help developers make use of our trained model in web applications.
- Score: 58.768804813646334
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Subtle and overt racism is still present both in physical and online
communities today and has impacted many lives in different segments of the
society. In this short piece of work, we present how we're tackling this
societal issue with Natural Language Processing. We are releasing BiasCorp, a
dataset containing 139,090 comments and news segment from three specific
sources - Fox News, BreitbartNews and YouTube. The first batch (45,000 manually
annotated) is ready for publication. We are currently in the final phase of
manually labeling the remaining dataset using Amazon Mechanical Turk. BERT has
been used widely in several downstream tasks. In this work, we present hBERT,
where we modify certain layers of the pretrained BERT model with the new
Hopfield Layer. hBert generalizes well across different distributions with the
added advantage of a reduced model complexity. We are also releasing a
JavaScript library and a Chrome Extension Application, to help developers make
use of our trained model in web applications (say chat application) and for
users to identify and report racially biased contents on the web respectively.
Related papers
- Detecting Racist Text in Bengali: An Ensemble Deep Learning Framework [0.0]
Racism is an alarming phenomenon in our country as well as all over the world.
We have built a novel dataset in the Bengali Language.
We have successfully achieved text detection with an impressive accuracy rate of 87.94%.
arXiv Detail & Related papers (2024-01-30T04:56:55Z) - Understanding writing style in social media with a supervised
contrastively pre-trained transformer [57.48690310135374]
Online Social Networks serve as fertile ground for harmful behavior, ranging from hate speech to the dissemination of disinformation.
We introduce the Style Transformer for Authorship Representations (STAR), trained on a large corpus derived from public sources of 4.5 x 106 authored texts.
Using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80% accuracy.
arXiv Detail & Related papers (2023-10-17T09:01:17Z) - Comparative Study of Pre-Trained BERT Models for Code-Mixed
Hindi-English Data [0.7874708385247353]
"Code Mixed" refers to the use of more than one language in the same text.
In this work, we focus on low-resource Hindi-English code-mixed language.
We report state-of-the-art results on respective datasets using HingBERT-based models.
arXiv Detail & Related papers (2023-05-25T05:10:28Z) - BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate
Speech in Different Social Contexts [1.5483942282713241]
This paper introduces a large manually labeled dataset that includes Hate Speech in different social contexts.
The dataset includes more than 50,200 offensive comments crawled from online social networking sites.
In experiments, we found that a word embedding trained exclusively using 1.47 million comments consistently resulted in better modeling of HS detection.
arXiv Detail & Related papers (2022-06-01T10:10:15Z) - BERTuit: Understanding Spanish language in Twitter through a native
transformer [70.77033762320572]
We present bfBERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets.
Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network.
arXiv Detail & Related papers (2022-04-07T14:28:51Z) - The World of an Octopus: How Reporting Bias Influences a Language
Model's Perception of Color [73.70233477125781]
We show that reporting bias negatively impacts and inherently limits text-only training.
We then demonstrate that multimodal models can leverage their visual training to mitigate these effects.
arXiv Detail & Related papers (2021-10-15T16:28:17Z) - FBERT: A Neural Transformer for Identifying Offensive Content [67.12838911384024]
fBERT is a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances.
We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID.
The fBERT model will be made freely available to the community.
arXiv Detail & Related papers (2021-09-10T19:19:26Z) - Scarecrow: A Framework for Scrutinizing Machine Text [69.26985439191151]
We introduce a new structured, crowdsourced error annotation schema called Scarecrow.
Scarecrow collects 13k annotations of 1.3k human and machine generate paragraphs of English language news text.
These findings demonstrate the value of Scarecrow annotations in the assessment of current and future text generation systems.
arXiv Detail & Related papers (2021-07-02T22:37:03Z) - Pre-Training BERT on Arabic Tweets: Practical Considerations [11.087099497830552]
We pretrained 5 BERT models that differ in the size of their training sets, mixture of formal and informal Arabic, and linguistic preprocessing.
All are intended to support Arabic dialects and social media.
New models achieve new state-of-the-art results on several downstream tasks.
arXiv Detail & Related papers (2021-02-21T20:51:33Z) - Detecting Insincere Questions from Text: A Transfer Learning Approach [0.0]
The internet today has become an unrivalled source of information where people converse on content based websites such as Quora, Reddit, StackOverflow and Twitter.
A major arising problem with such websites is the proliferation of toxic comments or instances of insincerity wherein the users instead of maintaining a sincere motive indulge in spreading toxic and divisive content.
In this paper we solve the Insincere Questions Classification problem by fine tuning four cutting age models viz BERT, RoBERTa, DistilBERT and ALBERT.
arXiv Detail & Related papers (2020-12-07T15:03:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.