WEIRD ICWSM: How Western, Educated, Industrialized, Rich, and Democratic is Social Computing Research?
- URL: http://arxiv.org/abs/2406.02090v2
- Date: Tue, 11 Jun 2024 13:34:09 GMT
- Title: WEIRD ICWSM: How Western, Educated, Industrialized, Rich, and Democratic is Social Computing Research?
- Authors: Ali Akbar Septiandri, Marios Constantinides, Daniele Quercia,
- Abstract summary: We evaluated the dependence on WEIRD populations in research presented at the AAAI ICWSM conference.
We found that 37% of these papers focused solely on data from Western countries.
The studies at ICWSM still predominantly examine populations from countries that are more Educated, Industrialized, and Rich.
- Score: 3.0829845709781725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Much of the research in social computing analyzes data from social media platforms, which may inherently carry biases. An overlooked source of such bias is the over-representation of WEIRD (Western, Educated, Industrialized, Rich, and Democratic) populations, which might not accurately mirror the global demographic diversity. We evaluated the dependence on WEIRD populations in research presented at the AAAI ICWSM conference; the only venue whose proceedings are fully dedicated to social computing research. We did so by analyzing 494 papers published from 2018 to 2022, which included full research papers, dataset papers and posters. After filtering out papers that analyze synthetic datasets or those lacking clear country of origin, we were left with 420 papers from which 188 participants in a crowdsourcing study with full manual validation extracted data for the WEIRD scores computation. This data was then used to adapt existing WEIRD metrics to be applicable for social media data. We found that 37% of these papers focused solely on data from Western countries. This percentage is significantly less than the percentages observed in research from CHI (76%) and FAccT (84%) conferences, suggesting a greater diversity of dataset origins within ICWSM. However, the studies at ICWSM still predominantly examine populations from countries that are more Educated, Industrialized, and Rich in comparison to those in FAccT, with a special note on the 'Democratic' variable reflecting political freedoms and rights. This points out the utility of social media data in shedding light on findings from countries with restricted political freedoms. Based on these insights, we recommend extensions of current "paper checklists" to include considerations about the WEIRD bias and call for the community to broaden research inclusivity by encouraging the use of diverse datasets from underrepresented regions.
Related papers
- Fairness in LLM-Generated Surveys [0.5720786928479238]
Large Language Models (LLMs) excel in text generation and understanding, especially simulating socio-political and economic patterns.
This study examines how LLMs perform across diverse populations by analyzing public surveys from Chile and the United States.
Political identity and race significantly influence prediction accuracy, while in Chile, gender, education, and religious affiliation play more pronounced roles.
arXiv Detail & Related papers (2025-01-25T23:42:20Z) - Transforming Social Science Research with Transfer Learning: Social Science Survey Data Integration with AI [0.4944564023471818]
Large-N nationally representative surveys, which have profoundly shaped American politics scholarship, represent related but distinct domains.
Our study introduces a novel application of transfer learning (TL) to address these gaps.
Models pre-trained on the Cooperative Election Study dataset are fine-tuned for use in the American National Election Studies dataset.
arXiv Detail & Related papers (2025-01-11T16:01:44Z) - Bridging the Data Provenance Gap Across Text, Speech and Video [67.72097952282262]
We conduct the largest and first-of-its-kind longitudinal audit across modalities of popular text, speech, and video datasets.
Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries.
We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets.
arXiv Detail & Related papers (2024-12-19T01:30:19Z) - Representation Bias in Political Sample Simulations with Large Language Models [54.48283690603358]
This study seeks to identify and quantify biases in simulating political samples with Large Language Models.
Using the GPT-3.5-Turbo model, we leverage data from the American National Election Studies, German Longitudinal Election Study, Zuobiao dataset, and China Family Panel Studies.
arXiv Detail & Related papers (2024-07-16T05:52:26Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Position: AI/ML Influencers Have a Place in the Academic Process [82.2069685579588]
We investigate the role of social media influencers in enhancing the visibility of machine learning research.
We have compiled a comprehensive dataset of over 8,000 papers, spanning tweets from December 2018 to October 2023.
Our statistical and causal inference analysis reveals a significant increase in citations for papers endorsed by these influencers.
arXiv Detail & Related papers (2024-01-24T20:05:49Z) - Challenges in Annotating Datasets to Quantify Bias in Under-represented
Society [7.9342597513806865]
Benchmark bias datasets have been developed for binary gender classification and ethical/racial considerations.
Motivated by the lack of annotated datasets for quantifying bias in under-represented societies, we created benchmark datasets for the New Zealand (NZ) population.
This research outlines the manual annotation process, provides an overview of the challenges we encountered and lessons learnt, and presents recommendations for future research.
arXiv Detail & Related papers (2023-09-11T22:24:39Z) - WEIRD FAccTs: How Western, Educated, Industrialized, Rich, and
Democratic is FAccT? [8.12219922021227]
Studies conducted on Western, Educated, Industrialized, Rich, and Democratic (WEIRD) samples are considered atypical of the world's population.
This study aims to quantify the extent to which the ACM FAccT conference relies on WEIRD samples.
arXiv Detail & Related papers (2023-05-10T18:52:09Z) - Fast Few shot Self-attentive Semi-supervised Political Inclination
Prediction [12.472629584751509]
It is increasingly common now for policymakers/journalists to create online polls on social media to understand the political leanings of people in specific locations.
We introduce a self-attentive semi-supervised framework for political inclination detection to further that objective.
We found that the model is highly efficient even in resource-constrained settings.
arXiv Detail & Related papers (2022-09-21T12:07:16Z) - Retiring Adult: New Datasets for Fair Machine Learning [47.27417042497261]
UCI Adult has served as the basis for the development and comparison of many algorithmic fairness interventions.
We reconstruct a superset of the UCI Adult data from available US Census sources and reveal idiosyncrasies of the UCI Adult dataset that limit its external validity.
Our primary contribution is a suite of new datasets that extend the existing data ecosystem for research on fair machine learning.
arXiv Detail & Related papers (2021-08-10T19:19:41Z) - Leveraging Administrative Data for Bias Audits: Assessing Disparate
Coverage with Mobility Data for COVID-19 Policy [61.60099467888073]
We show how linking administrative data can enable auditing mobility data for bias.
We show that older and non-white voters are less likely to be captured by mobility data.
We show that allocating public health resources based on such mobility data could disproportionately harm high-risk elderly and minority groups.
arXiv Detail & Related papers (2020-11-14T02:04:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.