On the Inference of Sociodemographics on Reddit
- URL: http://arxiv.org/abs/2502.05049v1
- Date: Fri, 07 Feb 2025 16:11:39 GMT
- Title: On the Inference of Sociodemographics on Reddit
- Authors: Federico Cinus, Corrado Monti, Paolo Bajardi, Gianmarco De Francisci Morales,
- Abstract summary: We use a novel data set of more than 850k self-declarations on age, gender, and partisan affiliation from Reddit comments.<n>We do so on two tasks: ($i$) predicting binary labels (classification); and ($ii$)predicting the prevalence of a demographic class among a set of users.
- Score: 5.524795406792588
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Inference of sociodemographic attributes of social media users is an essential step for computational social science (CSS) research to link online and offline behavior. However, there is a lack of a systematic evaluation and clear guidelines for optimal methodologies for this task on Reddit, one of today's largest social media. In this study, we fill this gap by comparing state-of-the-art (SOTA) and probabilistic models. To this end, first we collect a novel data set of more than 850k self-declarations on age, gender, and partisan affiliation from Reddit comments. Then, we systematically compare alternatives to the widely used embedding-based model and labeling techniques for the definition of the ground-truth. We do so on two tasks: ($i$) predicting binary labels (classification); and ($ii$)~predicting the prevalence of a demographic class among a set of users (quantification). Our findings reveal that Naive Bayes models not only offer transparency and interpretability by design but also consistently outperform the SOTA. Specifically, they achieve an improvement in ROC AUC of up to $19\%$ and maintain a mean absolute error (MAE) below $15\%$ in quantification for large-scale data settings. Finally, we discuss best practices for researchers in CSS, emphasizing coverage, interpretability, reliability, and scalability. The code and model weights used for the experiments are publicly available.\footnote{https://anonymous.4open.science/r/SDI-submission-5234}
Related papers
- Evaluating the Fairness of Discriminative Foundation Models in Computer
Vision [51.176061115977774]
We propose a novel taxonomy for bias evaluation of discriminative foundation models, such as Contrastive Language-Pretraining (CLIP)
We then systematically evaluate existing methods for mitigating bias in these models with respect to our taxonomy.
Specifically, we evaluate OpenAI's CLIP and OpenCLIP models for key applications, such as zero-shot classification, image retrieval and image captioning.
arXiv Detail & Related papers (2023-10-18T10:32:39Z) - Subjective Crowd Disagreements for Subjective Data: Uncovering
Meaningful CrowdOpinion with Population-level Learning [8.530934084017966]
We introduce emphCrowdOpinion, an unsupervised learning approach that uses language features and label distributions to pool similar items into larger samples of label distributions.
We use five publicly available benchmark datasets (with varying levels of annotator disagreements) from social media.
We also experiment in the wild using a dataset from Facebook, where annotations come from the platform itself by users reacting to posts.
arXiv Detail & Related papers (2023-07-07T22:09:46Z) - Using Imperfect Surrogates for Downstream Inference: Design-based
Supervised Learning for Social Science Applications of Large Language Models [0.2812395851874055]
computational social science (CSS) analyze documents to explain social and political phenomena.
One increasingly common way to annotate documents cheaply at scale is through large language models.
We present a new algorithm for using imperfect annotation surrogates for downstream statistical analyses.
arXiv Detail & Related papers (2023-06-07T19:49:41Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Non-Invasive Fairness in Learning through the Lens of Data Drift [88.37640805363317]
We show how to improve the fairness of Machine Learning models without altering the data or the learning algorithm.
We use a simple but key insight: the divergence of trends between different populations, and, consecutively, between a learned model and minority populations, is analogous to data drift.
We explore two strategies (model-splitting and reweighing) to resolve this drift, aiming to improve the overall conformance of models to the underlying data.
arXiv Detail & Related papers (2023-03-30T17:30:42Z) - A soft nearest-neighbor framework for continual semi-supervised learning [35.957577587090604]
We propose an approach for continual semi-supervised learning where not all the data samples are labeled.
We leverage the power of nearest-neighbors to nonlinearly partition the feature space and flexibly model the underlying data distribution.
Our method works well on both low and high resolution images and scales seamlessly to more complex datasets.
arXiv Detail & Related papers (2022-12-09T20:03:59Z) - Spuriosity Rankings: Sorting Data to Measure and Mitigate Biases [62.54519787811138]
We present a simple but effective method to measure and mitigate model biases caused by reliance on spurious cues.
We rank images within their classes based on spuriosity, proxied via deep neural features of an interpretable network.
Our results suggest that model bias due to spurious feature reliance is influenced far more by what the model is trained on than how it is trained.
arXiv Detail & Related papers (2022-12-05T23:15:43Z) - D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling
Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases.
A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network.
For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z) - SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised
Classification [24.386165255835063]
A common classification task situation is where one has a large amount of data available for training, but only a small portion is with class labels.
The goal of semi-supervised training, in this context, is to improve classification accuracy by leverage information from a large amount of unlabeled data.
We propose a novel unsupervised objective that focuses on the less studied relationship between the high confidence unlabeled data that are similar to each other.
Our proposed SimPLE algorithm shows significant performance gains over previous algorithms on CIFAR-100 and Mini-ImageNet, and is on par with the state-of-the-art methods
arXiv Detail & Related papers (2021-03-30T23:48:06Z) - Automatic Face Understanding: Recognizing Families in Photos [6.131589026706621]
We build the largest database for kinship recognition.
Video dynamics, audio, and text captions can be used in the decision making of kinship recognition systems.
arXiv Detail & Related papers (2021-01-10T22:37:25Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.