The Constant in HATE: Analyzing Toxicity in Reddit across Topics and Languages
- URL: http://arxiv.org/abs/2404.18726v1
- Date: Mon, 29 Apr 2024 14:14:33 GMT
- Title: The Constant in HATE: Analyzing Toxicity in Reddit across Topics and Languages
- Authors: Wondimagegnhue Tsegaye Tufa, Ilia Markov, Piek Vossen,
- Abstract summary: Toxic language remains an ongoing challenge on social media platforms.
This paper provides a cross-topic and cross-lingual analysis of toxicity in Reddit conversations.
- Score: 2.5398014196797605
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Toxic language remains an ongoing challenge on social media platforms, presenting significant issues for users and communities. This paper provides a cross-topic and cross-lingual analysis of toxicity in Reddit conversations. We collect 1.5 million comment threads from 481 communities in six languages: English, German, Spanish, Turkish,Arabic, and Dutch, covering 80 topics such as Culture, Politics, and News. We thoroughly analyze how toxicity spikes within different communities in relation to specific topics. We observe consistent patterns of increased toxicity across languages for certain topics, while also noting significant variations within specific language communities.
Related papers
- Polarized Patterns of Language Toxicity and Sentiment of Debunking Posts on Social Media [5.301808480190602]
The rise of misinformation and fake news in online political discourse poses significant challenges to democratic processes and public engagement.
We examined over 86 million debunking tweets and more than 4 million Reddit debunking comments to investigate the relationship between language toxicity, pessimism, and social polarization in debunking efforts.
We show that platform architecture affects informational complexity of user interactions, with Twitter promoting concentrated, uniform discourse and Reddit encouraging diverse, complex communication.
arXiv Detail & Related papers (2025-01-10T08:00:58Z) - Multilingual and Explainable Text Detoxification with Parallel Corpora [58.83211571400692]
We extend parallel text detoxification corpus to new languages.
We conduct the first of its kind an automated, explainable analysis of the descriptive features of both toxic and non-toxic sentences.
We then experiment with a novel text detoxification method inspired by the Chain-of-Thoughts reasoning approach.
arXiv Detail & Related papers (2024-12-16T12:08:59Z) - Grounding Toxicity in Real-World Events across Languages [2.5398014196797605]
Events in the real world, like elections or conflicts, can initiate and escalate toxic behavior online.
We gathered Reddit data comprising 4.5 million comments from 31 thousand posts in six different languages.
We observe significant variations in toxicity, negative sentiment, and emotion expressions across different events and language communities.
arXiv Detail & Related papers (2024-05-22T15:38:53Z) - Analyzing Toxicity in Deep Conversations: A Reddit Case Study [0.0]
This work employs a tree-based approach to understand how users behave concerning toxicity in public conversation settings.
We collect both the posts and the comment sections of the top 100 posts from 8 Reddit communities that allow profanity, totaling over 1 million responses.
We find that toxic comments increase the likelihood of subsequent toxic comments being produced in online conversations.
arXiv Detail & Related papers (2024-04-11T16:10:44Z) - Comprehensive Assessment of Toxicity in ChatGPT [49.71090497696024]
We evaluate the toxicity in ChatGPT by utilizing instruction-tuning datasets.
prompts in creative writing tasks can be 2x more likely to elicit toxic responses.
Certain deliberately toxic prompts, designed in earlier studies, no longer yield harmful responses.
arXiv Detail & Related papers (2023-11-03T14:37:53Z) - Twits, Toxic Tweets, and Tribal Tendencies: Trends in Politically Polarized Posts on Twitter [5.161088104035108]
We explore the role that partisanship and affective polarization play in contributing to toxicity on an individual level and a topic level on Twitter/X.
After collecting 89.6 million tweets from 43,151 Twitter/X users, we determine how several account-level characteristics, including partisanship, predict how often users post toxic content.
arXiv Detail & Related papers (2023-07-19T17:24:47Z) - Analyzing Norm Violations in Live-Stream Chat [49.120561596550395]
We study the first NLP study dedicated to detecting norm violations in conversations on live-streaming platforms.
We define norm violation categories in live-stream chats and annotate 4,583 moderated comments from Twitch.
Our results show that appropriate contextual information can boost moderation performance by 35%.
arXiv Detail & Related papers (2023-05-18T05:58:27Z) - Cross-Lingual Ability of Multilingual Masked Language Models: A Study of
Language Structure [54.01613740115601]
We study three language properties: constituent order, composition and word co-occurrence.
Our main conclusion is that the contribution of constituent order and word co-occurrence is limited, while the composition is more crucial to the success of cross-linguistic transfer.
arXiv Detail & Related papers (2022-03-16T07:09:35Z) - Beyond Plain Toxic: Detection of Inappropriate Statements on Flammable
Topics for the Russian Language [76.58220021791955]
We present two text collections labelled according to binary notion of inapropriateness and a multinomial notion of sensitive topic.
To objectivise the notion of inappropriateness, we define it in a data-driven way though crowdsourcing.
arXiv Detail & Related papers (2022-03-04T15:59:06Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Annotators with Attitudes: How Annotator Beliefs And Identities Bias
Toxic Language Detection [75.54119209776894]
We investigate the effect of annotator identities (who) and beliefs (why) on toxic language annotations.
We consider posts with three characteristics: anti-Black language, African American English dialect, and vulgarity.
Our results show strong associations between annotator identity and beliefs and their ratings of toxicity.
arXiv Detail & Related papers (2021-11-15T18:58:20Z) - Toxic Language Detection in Social Media for Brazilian Portuguese: New
Dataset and Multilingual Analysis [4.251937086394346]
State-of-the-art BERT models were able to achieve 76% macro-F1 score using monolingual data in the binary case.
We show that large-scale monolingual data is still needed to create more accurate models.
arXiv Detail & Related papers (2020-10-09T13:05:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.