Related papers: A Multi-Labeled Dataset for Indonesian Discourse: Examining Toxicity, Polarization, and Demographics Information

A Multi-Labeled Dataset for Indonesian Discourse: Examining Toxicity, Polarization, and Demographics Information

URL: http://arxiv.org/abs/2503.00417v1
Date: Sat, 01 Mar 2025 09:33:10 GMT
Title: A Multi-Labeled Dataset for Indonesian Discourse: Examining Toxicity, Polarization, and Demographics Information
Authors: Lucky Susanto, Musa Wijanarko, Prasetia Pratama, Zilu Tang, Fariz Akyas, Traci Hong, Ika Idris, Alham Aji, Derry Wijaya,
Abstract summary: As the world's third-largest democracy, Indonesia faces growing concerns about the interplay between political polarization and online toxicity.<n>Previous NLP research has not fully explored the relationship between toxicity and polarization.<n>We present a novel multi-label Indonesian dataset that incorporates toxicity, polarization, and annotator demographic information.
Score: 2.8697660350772063
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Polarization is defined as divisive opinions held by two or more groups on substantive issues. As the world's third-largest democracy, Indonesia faces growing concerns about the interplay between political polarization and online toxicity, which is often directed at vulnerable minority groups. Despite the importance of this issue, previous NLP research has not fully explored the relationship between toxicity and polarization. To bridge this gap, we present a novel multi-label Indonesian dataset that incorporates toxicity, polarization, and annotator demographic information. Benchmarking this dataset using BERT-base models and large language models (LLMs) shows that polarization information enhances toxicity classification, and vice versa. Furthermore, providing demographic information significantly improves the performance of polarization classification.

Related papers

Measuring Social Media Polarization Using Large Language Models and Heuristic Rules [0.0]
This study systematically analyze and quantify affective polarization in discussions on divisive topics such as climate change and gun control.<n>By combining AI-driven content annotation with domain-informed scoring, our framework offers a scalable and interpretable approach to measuring affective polarization.
arXiv Detail & Related papers (2026-01-02T01:11:58Z)
BIPOLAR: Polarization-based granular framework for LLM bias evaluation [0.0]
This study proposes a reusable, granular, and topic-agnostic framework to evaluate polarisation-related biases in large language models.<n>Our approach combines polarisation-sensitive sentiment metrics with a synthetically generated balanced dataset of conflict-related statements.<n>As a case study, we created a synthetic dataset that focusses on the Russia-Ukraine war, and we evaluated the bias in several LLMs.
arXiv Detail & Related papers (2025-08-14T20:44:19Z)
Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models [52.00270888041742]
We introduce a novel dataset with neutral event descriptions and contrasting viewpoints from different countries.<n>Our findings show significant geopolitical biases, with models favoring specific national narratives.<n>Simple debiasing prompts had a limited effect on reducing these biases.
arXiv Detail & Related papers (2025-06-07T10:45:17Z)
Mitigating Subgroup Disparities in Multi-Label Speech Emotion Recognition: A Pseudo-Labeling and Unsupervised Learning Approach [53.824673312331626]
Implicit Demography Inference (IDI) module uses k-means clustering to mitigate bias in Speech Emotion Recognition (SER)<n>Experiments show that pseudo-labeling IDI reduces subgroup disparities, improving fairness metrics by over 28%.<n>Unsupervised IDI yields more than a 4.6% improvement in fairness metrics with a drop of less than 3.6% in SER performance.
arXiv Detail & Related papers (2025-05-20T14:50:44Z)
Visual Polarization Measurement Using Counterfactual Image Generation [0.0]
We introduce the Polarization Measurement using Counterfactual Image Generation (PMCIG) method. We identify significant polarization in visual content, with notable variations across outlets and politicians. At the politician level, our results reveal substantial variation in polarized coverage, with Donald Trump and Barack Obama among the most polarizing figures.
arXiv Detail & Related papers (2025-03-13T16:32:07Z)
A More Advanced Group Polarization Measurement Approach Based on LLM-Based Agents and Graphs [5.285847977231642]
Measuring group polarization on social media presents several challenges that have not yet been addressed by existing solutions.<n>We designed a solution based on a multi-agent system and used a graph-structured Community Sentiment Network (CSN) to represent polarization states.<n>In summary, the proposed approach has significant value in terms of usability, accuracy, and interpretability.
arXiv Detail & Related papers (2024-11-19T03:29:17Z)
Robustness and Confounders in the Demographic Alignment of LLMs with Human Perceptions of Offensiveness [10.194622474615462]
Large language models (LLMs) are known to exhibit demographic biases, yet few studies systematically evaluate these biases across multiple datasets or account for confounding factors. Our findings reveal that while demographic traits, particularly race, influence alignment, these effects are inconsistent across datasets and often entangled with other factors.
arXiv Detail & Related papers (2024-11-13T19:08:23Z)
The Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention [61.80236015147771]
We quantify the trade-off between using diversity interventions and preserving demographic factuality in T2I models. Experiments on DoFaiR reveal that diversity-oriented instructions increase the number of different gender and racial groups. We propose Fact-Augmented Intervention (FAI) to reflect on verbalized or retrieved factual information about gender and racial compositions of generation subjects in history.
arXiv Detail & Related papers (2024-06-29T09:09:42Z)
Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information [50.29934517930506]
DAFair is a novel approach to address social bias in language models. We leverage prototypical demographic texts and incorporate a regularization term during the fine-tuning process to mitigate bias.
arXiv Detail & Related papers (2024-03-14T15:58:36Z)
The Impact of Differential Feature Under-reporting on Algorithmic Fairness [86.275300739926]
We present an analytically tractable model of differential feature under-reporting. We then use to characterize the impact of this kind of data bias on algorithmic fairness. Our results show that, in real world data settings, under-reporting typically leads to increasing disparities.
arXiv Detail & Related papers (2024-01-16T19:16:22Z)
Mitigating Framing Bias with Polarity Minimization Loss [56.24404488440295]
Framing bias plays a significant role in exacerbating political polarization by distorting the perception of actual events. We propose a new loss function that encourages the model to minimize the polarity difference between the polarized input articles to reduce framing bias.
arXiv Detail & Related papers (2023-11-03T09:50:23Z)
Stable Bias: Analyzing Societal Representations in Diffusion Models [72.27121528451528]
We propose a new method for exploring the social biases in Text-to-Image (TTI) systems. Our approach relies on characterizing the variation in generated images triggered by enumerating gender and ethnicity markers in the prompts. We leverage this method to analyze images generated by 3 popular TTI systems and find that while all of their outputs show correlations with US labor demographics, they also consistently under-represent marginalized identities to different extents.
arXiv Detail & Related papers (2023-03-20T19:32:49Z)
Unveiling the Hidden Agenda: Biases in News Reporting and Consumption [59.55900146668931]
We build a six-year dataset on the Italian vaccine debate and adopt a Bayesian latent space model to identify narrative and selection biases. We found a nonlinear relationship between biases and engagement, with higher engagement for extreme positions. Analysis of news consumption on Twitter reveals common audiences among news outlets with similar ideological positions.
arXiv Detail & Related papers (2023-01-14T18:58:42Z)
Exploring Polarization of Users Behavior on Twitter During the 2019 South American Protests [15.065938163384235]
We explore polarization on Twitter in a different context, namely the protest that paralyzed several countries in the South American region in 2019. By leveraging users' endorsement of politicians' tweets and hashtag campaigns with defined stances towards the protest (for or against), we construct a weakly labeled stance dataset with millions of users. We find empirical evidence of the "filter bubble" phenomenon during the event, as we not only show that the user bases are homogeneous in terms of stance, but the probability that a user transitions from media of different clusters is low.
arXiv Detail & Related papers (2021-04-05T07:13:18Z)
Toxic Language Detection in Social Media for Brazilian Portuguese: New Dataset and Multilingual Analysis [4.251937086394346]
State-of-the-art BERT models were able to achieve 76% macro-F1 score using monolingual data in the binary case. We show that large-scale monolingual data is still needed to create more accurate models.
arXiv Detail & Related papers (2020-10-09T13:05:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.