Linguistic Fingerprints of Internet Censorship: the Case of SinaWeibo
- URL: http://arxiv.org/abs/2001.08845v1
- Date: Thu, 23 Jan 2020 23:08:24 GMT
- Title: Linguistic Fingerprints of Internet Censorship: the Case of SinaWeibo
- Authors: Kei Yin Ng, Anna Feldman, Jing Peng
- Abstract summary: This paper studies how the linguistic components of blogposts might affect the blogposts' likelihood of being censored.
We build a classifier that significantly outperforms non-expert humans in predicting whether a blogpost will be censored.
Our work suggests that it is possible to use linguistic properties of social media posts to automatically predict if they are going to be censored.
- Score: 4.544151613454639
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies how the linguistic components of blogposts collected from
Sina Weibo, a Chinese microblogging platform, might affect the blogposts'
likelihood of being censored. Our results go along with King et al. (2013)'s
Collective Action Potential (CAP) theory, which states that a blogpost's
potential of causing riot or assembly in real life is the key determinant of it
getting censored. Although there is not a definitive measure of this construct,
the linguistic features that we identify as discriminatory go along with the
CAP theory. We build a classifier that significantly outperforms non-expert
humans in predicting whether a blogpost will be censored. The crowdsourcing
results suggest that while humans tend to see censored blogposts as more
controversial and more likely to trigger action in real life than the
uncensored counterparts, they in general cannot make a better guess than our
model when it comes to `reading the mind' of the censors in deciding whether a
blogpost should be censored. We do not claim that censorship is only determined
by the linguistic features. There are many other factors contributing to
censorship decisions. The focus of the present paper is on the linguistic form
of blogposts. Our work suggests that it is possible to use linguistic
properties of social media posts to automatically predict if they are going to
be censored.
Related papers
- Why Should This Article Be Deleted? Transparent Stance Detection in
Multilingual Wikipedia Editor Discussions [47.944081120226905]
We construct a novel dataset of Wikipedia editor discussions along with their reasoning in three languages.
The dataset contains the stances of the editors (keep, delete, merge, comment), along with the stated reason, and a content moderation policy, for each edit decision.
We demonstrate that stance and corresponding reason (policy) can be predicted jointly with a high degree of accuracy, adding transparency to the decision-making process.
arXiv Detail & Related papers (2023-10-09T15:11:02Z) - How We Express Ourselves Freely: Censorship, Self-censorship, and
Anti-censorship on a Chinese Social Media [4.408128846525362]
We identify the metrics of censorship and self-censorship, find the influence factors, and construct a mediation model to measure their relationship.
Based on these findings, we discuss implications for democratic social media design and future censorship research.
arXiv Detail & Related papers (2022-11-24T18:28:16Z) - The State of Profanity Obfuscation in Natural Language Processing [29.95449849179384]
obfuscating profanities makes it challenging to evaluate the content, especially for non-native speakers.
We suggest a multilingual community resource called PrOf that has a Python module to standardize profanity obfuscation processes.
arXiv Detail & Related papers (2022-10-14T07:45:36Z) - Analyzing the Intensity of Complaints on Social Media [55.140613801802886]
We present the first study in computational linguistics of measuring the intensity of complaints from text.
We create the first Chinese dataset containing 3,103 posts about complaints from Weibo, a popular Chinese social media platform.
We show that complaints intensity can be accurately estimated by computational models with the best mean square error achieving 0.11.
arXiv Detail & Related papers (2022-04-20T10:15:44Z) - Beyond Plain Toxic: Detection of Inappropriate Statements on Flammable
Topics for the Russian Language [76.58220021791955]
We present two text collections labelled according to binary notion of inapropriateness and a multinomial notion of sensitive topic.
To objectivise the notion of inappropriateness, we define it in a data-driven way though crowdsourcing.
arXiv Detail & Related papers (2022-03-04T15:59:06Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Annotators with Attitudes: How Annotator Beliefs And Identities Bias
Toxic Language Detection [75.54119209776894]
We investigate the effect of annotator identities (who) and beliefs (why) on toxic language annotations.
We consider posts with three characteristics: anti-Black language, African American English dialect, and vulgarity.
Our results show strong associations between annotator identity and beliefs and their ratings of toxicity.
arXiv Detail & Related papers (2021-11-15T18:58:20Z) - Is radicalization reinforced by social media censorship? [0.0]
Radicalized beliefs, such as those tied to QAnon, Russiagate, and other political conspiracy theories, can lead some individuals and groups to engage in violent behavior.
This article presents and agent-based model of a social media network that enables investigation of the effects of censorship on the amount of dissenting information.
arXiv Detail & Related papers (2021-03-23T21:07:34Z) - A Dataset of State-Censored Tweets [3.0254442724635173]
We release a dataset of 583,437 tweets by 155,715 users that were censored between 2012-2020 July.
We also release 4,301 accounts that were censored in their entirety.
Our dataset will not only aid in the study of government censorship but will also aid in studying hate speech detection and the effect of censorship on social media users.
arXiv Detail & Related papers (2021-01-15T00:18:27Z) - Reading In-Between the Lines: An Analysis of Dissenter [2.2881898195409884]
We study Dissenter, a browser and web application that provides a conversational overlay for any web page.
In this work, we obtain a history of Dissenter comments, users, and the websites being discussed.
Our corpus consists of approximately 1.68M comments made by 101k users commenting on 588k distinct URLs.
arXiv Detail & Related papers (2020-09-03T16:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.