Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing
- URL: http://arxiv.org/abs/2503.05280v2
- Date: Mon, 10 Mar 2025 04:41:06 GMT
- Title: Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing
- Authors: Neemesh Yadav, Jiarui Liu, Francesco Ortu, Roya Ensafi, Zhijing Jin, Rada Mihalcea,
- Abstract summary: We study content moderation decisions made across countries using pre-existing corpora from the Twitter Stream Grab.<n>Our experiments reveal interesting patterns in censored posts, both across countries and over time.<n>We assess the effectiveness of using LLMs in content moderation.
- Score: 34.69237228285959
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability of Natural Language Processing (NLP) methods to categorize text into multiple classes has motivated their use in online content moderation tasks, such as hate speech and fake news detection. However, there is limited understanding of how or why these methods make such decisions, or why certain content is moderated in the first place. To investigate the hidden mechanisms behind content moderation, we explore multiple directions: 1) training classifiers to reverse-engineer content moderation decisions across countries; 2) explaining content moderation decisions by analyzing Shapley values and LLM-guided explanations. Our primary focus is on content moderation decisions made across countries, using pre-existing corpora sampled from the Twitter Stream Grab. Our experiments reveal interesting patterns in censored posts, both across countries and over time. Through human evaluations of LLM-generated explanations across three LLMs, we assess the effectiveness of using LLMs in content moderation. Finally, we discuss potential future directions, as well as the limitations and ethical considerations of this work. Our code and data are available at https://github.com/causalNLP/censorship
Related papers
- What Large Language Models Do Not Talk About: An Empirical Study of Moderation and Censorship Practices [46.30336056625582]
This work investigates the extent to which Large Language Models refuse to answer or omit information when prompted on political topics.
Our analysis covers 14 state-of-the-art models from Western countries, China, and Russia, prompted in all six official United Nations (UN) languages.
arXiv Detail & Related papers (2025-04-04T09:09:06Z) - Advancing Content Moderation: Evaluating Large Language Models for Detecting Sensitive Content Across Text, Images, and Videos [0.1399948157377307]
Governments, educators, and parents are often at odds with media platforms about how to regulate, control, and limit the spread of such content.
Techniques from natural language processing and computer vision have been used widely to automatically identify and filter out sensitive content.
More sophisticated algorithms for understanding the context of both text and image may open rooms for improvement in content censorship.
arXiv Detail & Related papers (2024-11-26T05:29:18Z) - Can Large Language Models (or Humans) Disentangle Text? [6.858838842613459]
We investigate the potential of large language models (LLMs) to disentangle text variables.
We employ a range of various LLM approaches in an attempt to disentangle text by identifying and removing information about a target variable.
We show that in the strong test of removing sentiment, the statistical association between the processed text and sentiment is still detectable to machine learning classifiers.
arXiv Detail & Related papers (2024-03-25T09:51:54Z) - Algorithmic Arbitrariness in Content Moderation [1.4849645397321183]
We show how content moderation tools can arbitrarily classify samples as toxic.
We discuss these findings in terms of human rights set out by the International Covenant on Civil and Political Rights (ICCPR)
Our study underscores the need to identify and increase the transparency of arbitrariness in content moderation applications.
arXiv Detail & Related papers (2024-02-26T19:27:00Z) - Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation.
We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process.
We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z) - Why Should This Article Be Deleted? Transparent Stance Detection in
Multilingual Wikipedia Editor Discussions [47.944081120226905]
We construct a novel dataset of Wikipedia editor discussions along with their reasoning in three languages.
The dataset contains the stances of the editors (keep, delete, merge, comment), along with the stated reason, and a content moderation policy, for each edit decision.
We demonstrate that stance and corresponding reason (policy) can be predicted jointly with a high degree of accuracy, adding transparency to the decision-making process.
arXiv Detail & Related papers (2023-10-09T15:11:02Z) - Towards Effective Disambiguation for Machine Translation with Large
Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences"
Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z) - Cross-Lingual Knowledge Editing in Large Language Models [73.12622532088564]
Knowledge editing has been shown to adapt large language models to new knowledge without retraining from scratch.
It is still unknown the effect of source language editing on a different target language.
We first collect a large-scale cross-lingual synthetic dataset by translating ZsRE from English to Chinese.
arXiv Detail & Related papers (2023-09-16T11:07:52Z) - Evaluating GPT-3 Generated Explanations for Hateful Content Moderation [8.63841985804905]
We use GPT-3 to generate explanations for both hateful and non-hateful content.
A survey was conducted with 2,400 unique respondents to evaluate the generated explanations.
Our findings reveal that human evaluators rated the GPT-generated explanations as high quality in terms of linguistic fluency, informativeness, persuasiveness, and logical soundness.
arXiv Detail & Related papers (2023-05-28T10:05:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.