Diverse Perspectives, Divergent Models: Cross-Cultural Evaluation of Depression Detection on Twitter
- URL: http://arxiv.org/abs/2406.15362v2
- Date: Tue, 15 Oct 2024 13:31:41 GMT
- Title: Diverse Perspectives, Divergent Models: Cross-Cultural Evaluation of Depression Detection on Twitter
- Authors: Nuredin Ali, Charles Chuankai Zhang, Ned Mayo, Stevie Chancellor,
- Abstract summary: We evaluate the generalization of benchmark datasets to build AI models on cross-cultural Twitter data.
Our results show that depression detection models do not generalize globally.
Pre-trained language models achieve the best generalization compared to Logistic Regression, though still show significant gaps in performance on depressed and non-Western users.
- Score: 4.462334751640166
- License:
- Abstract: Social media data has been used for detecting users with mental disorders, such as depression. Despite the global significance of cross-cultural representation and its potential impact on model performance, publicly available datasets often lack crucial metadata related to this aspect. In this work, we evaluate the generalization of benchmark datasets to build AI models on cross-cultural Twitter data. We gather a custom geo-located Twitter dataset of depressed users from seven countries as a test dataset. Our results show that depression detection models do not generalize globally. The models perform worse on Global South users compared to Global North. Pre-trained language models achieve the best generalization compared to Logistic Regression, though still show significant gaps in performance on depressed and non-Western users. We quantify our findings and provide several actionable suggestions to mitigate this issue.
Related papers
- Does Progress On Object Recognition Benchmarks Improve Real-World
Generalization? [9.906591021385303]
Researchers have measured progress in object recognition on ImageNet-based generalization benchmarks such as ImageNet-A, -C, and -R for more than a decade.
Recent advances in foundation models, trained on orders of magnitude more data, have begun to saturate these standard benchmarks, but remain brittle in practice.
We propose studying generalization across geography as a more realistic measure of progress using two datasets of objects from households across the globe.
arXiv Detail & Related papers (2023-07-24T21:29:48Z) - Scaling Laws Do Not Scale [54.72120385955072]
Recent work has argued that as the size of a dataset increases, the performance of a model trained on that dataset will increase.
We argue that this scaling law relationship depends on metrics used to measure performance that may not correspond with how different groups of people perceive the quality of models' output.
Different communities may also have values in tension with each other, leading to difficult, potentially irreconcilable choices about metrics used for model evaluations.
arXiv Detail & Related papers (2023-07-05T15:32:21Z) - CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI
Collaboration for Large Language Models [52.25049362267279]
We present a Chinese Bias Benchmark dataset that consists of over 100K questions jointly constructed by human experts and generative language models.
The testing instances in the dataset are automatically derived from 3K+ high-quality templates manually authored with stringent quality control.
Extensive experiments demonstrate the effectiveness of the dataset in detecting model bias, with all 10 publicly available Chinese large language models exhibiting strong bias in certain categories.
arXiv Detail & Related papers (2023-06-28T14:14:44Z) - On Hate Scaling Laws For Data-Swamps [14.891493485229251]
We show that the presence of hateful content in datasets, when measured with a Hate Content Rate (HCR) metric, increased by nearly $12%$.
As scale increased, the tendency of the model to associate images of human faces with the human being' class over 7 other offensive classes reduced by half.
For the Black female category, the tendency of the model to associate their faces with the criminal' class doubled, while quintupling for Black male faces.
arXiv Detail & Related papers (2023-06-22T18:00:17Z) - GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models.
We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench.
GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z) - Generalization and Personalization of Mobile Sensing-Based Mood
Inference Models: An Analysis of College Students in Eight Countries [8.218081835111912]
We collect a mobile sensing dataset with 329K self-reports from 678 participants in eight countries.
We evaluate country-specific (trained and tested within a country), continent-specific (trained and tested within a continent), country-agnostic (tested on a country not seen on training data) and multi-country (trained and tested with multiple countries) approaches.
arXiv Detail & Related papers (2022-11-06T02:26:52Z) - Taxonomizing local versus global structure in neural network loss
landscapes [60.206524503782006]
We show that the best test accuracy is obtained when the loss landscape is globally well-connected.
We also show that globally poorly-connected landscapes can arise when models are small or when they are trained to lower quality data.
arXiv Detail & Related papers (2021-07-23T13:37:14Z) - A Globally Normalized Neural Model for Semantic Parsing [30.209064474475944]
We propose a globally normalized model for context-free grammar (CFG)-based semantic parsing.
Our model predicts a real-valued score at each step and does not suffer from the label bias problem.
arXiv Detail & Related papers (2021-06-07T07:06:36Z) - Dataset Cartography: Mapping and Diagnosing Datasets with Training
Dynamics [118.75207687144817]
We introduce Data Maps, a model-based tool to characterize and diagnose datasets.
We leverage a largely ignored source of information: the behavior of the model on individual instances during training.
Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.
arXiv Detail & Related papers (2020-09-22T20:19:41Z) - Adversarial Filters of Dataset Biases [96.090959788952]
Large neural models have demonstrated human-level performance on language and vision benchmarks.
Their performance degrades considerably on adversarial or out-of-distribution samples.
We propose AFLite, which adversarially filters such dataset biases.
arXiv Detail & Related papers (2020-02-10T21:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.