A Multi-Labeled Dataset for Indonesian Discourse: Examining Toxicity, Polarization, and Demographics Information
- URL: http://arxiv.org/abs/2503.00417v1
- Date: Sat, 01 Mar 2025 09:33:10 GMT
- Title: A Multi-Labeled Dataset for Indonesian Discourse: Examining Toxicity, Polarization, and Demographics Information
- Authors: Lucky Susanto, Musa Wijanarko, Prasetia Pratama, Zilu Tang, Fariz Akyas, Traci Hong, Ika Idris, Alham Aji, Derry Wijaya,
- Abstract summary: As the world's third-largest democracy, Indonesia faces growing concerns about the interplay between political polarization and online toxicity.<n>Previous NLP research has not fully explored the relationship between toxicity and polarization.<n>We present a novel multi-label Indonesian dataset that incorporates toxicity, polarization, and annotator demographic information.
- Score: 2.8697660350772063
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Polarization is defined as divisive opinions held by two or more groups on substantive issues. As the world's third-largest democracy, Indonesia faces growing concerns about the interplay between political polarization and online toxicity, which is often directed at vulnerable minority groups. Despite the importance of this issue, previous NLP research has not fully explored the relationship between toxicity and polarization. To bridge this gap, we present a novel multi-label Indonesian dataset that incorporates toxicity, polarization, and annotator demographic information. Benchmarking this dataset using BERT-base models and large language models (LLMs) shows that polarization information enhances toxicity classification, and vice versa. Furthermore, providing demographic information significantly improves the performance of polarization classification.
Related papers
- Visual Polarization Measurement Using Counterfactual Image Generation [0.0]
We introduce the Polarization Measurement using Counterfactual Image Generation (PMCIG) method.
We identify significant polarization in visual content, with notable variations across outlets and politicians.
At the politician level, our results reveal substantial variation in polarized coverage, with Donald Trump and Barack Obama among the most polarizing figures.
arXiv Detail & Related papers (2025-03-13T16:32:07Z) - A More Advanced Group Polarization Measurement Approach Based on LLM-Based Agents and Graphs [5.285847977231642]
Measuring group polarization on social media presents several challenges that have not yet been addressed by existing solutions.<n>We designed a solution based on a multi-agent system and used a graph-structured Community Sentiment Network (CSN) to represent polarization states.<n>In summary, the proposed approach has significant value in terms of usability, accuracy, and interpretability.
arXiv Detail & Related papers (2024-11-19T03:29:17Z) - Robustness and Confounders in the Demographic Alignment of LLMs with Human Perceptions of Offensiveness [10.194622474615462]
Large language models (LLMs) are known to exhibit demographic biases, yet few studies systematically evaluate these biases across multiple datasets or account for confounding factors.
Our findings reveal that while demographic traits, particularly race, influence alignment, these effects are inconsistent across datasets and often entangled with other factors.
arXiv Detail & Related papers (2024-11-13T19:08:23Z) - The Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention [61.80236015147771]
We quantify the trade-off between using diversity interventions and preserving demographic factuality in T2I models.
Experiments on DoFaiR reveal that diversity-oriented instructions increase the number of different gender and racial groups.
We propose Fact-Augmented Intervention (FAI) to reflect on verbalized or retrieved factual information about gender and racial compositions of generation subjects in history.
arXiv Detail & Related papers (2024-06-29T09:09:42Z) - Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information [50.29934517930506]
DAFair is a novel approach to address social bias in language models.
We leverage prototypical demographic texts and incorporate a regularization term during the fine-tuning process to mitigate bias.
arXiv Detail & Related papers (2024-03-14T15:58:36Z) - The Impact of Differential Feature Under-reporting on Algorithmic Fairness [86.275300739926]
We present an analytically tractable model of differential feature under-reporting.
We then use to characterize the impact of this kind of data bias on algorithmic fairness.
Our results show that, in real world data settings, under-reporting typically leads to increasing disparities.
arXiv Detail & Related papers (2024-01-16T19:16:22Z) - Mitigating Framing Bias with Polarity Minimization Loss [56.24404488440295]
Framing bias plays a significant role in exacerbating political polarization by distorting the perception of actual events.
We propose a new loss function that encourages the model to minimize the polarity difference between the polarized input articles to reduce framing bias.
arXiv Detail & Related papers (2023-11-03T09:50:23Z) - Stable Bias: Analyzing Societal Representations in Diffusion Models [72.27121528451528]
We propose a new method for exploring the social biases in Text-to-Image (TTI) systems.
Our approach relies on characterizing the variation in generated images triggered by enumerating gender and ethnicity markers in the prompts.
We leverage this method to analyze images generated by 3 popular TTI systems and find that while all of their outputs show correlations with US labor demographics, they also consistently under-represent marginalized identities to different extents.
arXiv Detail & Related papers (2023-03-20T19:32:49Z) - Unveiling the Hidden Agenda: Biases in News Reporting and Consumption [59.55900146668931]
We build a six-year dataset on the Italian vaccine debate and adopt a Bayesian latent space model to identify narrative and selection biases.
We found a nonlinear relationship between biases and engagement, with higher engagement for extreme positions.
Analysis of news consumption on Twitter reveals common audiences among news outlets with similar ideological positions.
arXiv Detail & Related papers (2023-01-14T18:58:42Z) - Exploring Polarization of Users Behavior on Twitter During the 2019
South American Protests [15.065938163384235]
We explore polarization on Twitter in a different context, namely the protest that paralyzed several countries in the South American region in 2019.
By leveraging users' endorsement of politicians' tweets and hashtag campaigns with defined stances towards the protest (for or against), we construct a weakly labeled stance dataset with millions of users.
We find empirical evidence of the "filter bubble" phenomenon during the event, as we not only show that the user bases are homogeneous in terms of stance, but the probability that a user transitions from media of different clusters is low.
arXiv Detail & Related papers (2021-04-05T07:13:18Z) - Toxic Language Detection in Social Media for Brazilian Portuguese: New
Dataset and Multilingual Analysis [4.251937086394346]
State-of-the-art BERT models were able to achieve 76% macro-F1 score using monolingual data in the binary case.
We show that large-scale monolingual data is still needed to create more accurate models.
arXiv Detail & Related papers (2020-10-09T13:05:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.