Towards Algorithmic Fidelity: Mental Health Representation across Demographics in Synthetic vs. Human-generated Data
- URL: http://arxiv.org/abs/2403.16909v1
- Date: Mon, 25 Mar 2024 16:21:25 GMT
- Title: Towards Algorithmic Fidelity: Mental Health Representation across Demographics in Synthetic vs. Human-generated Data
- Authors: Shinka Mori, Oana Ignat, Andrew Lee, Rada Mihalcea,
- Abstract summary: We develop HEADROOM, a synthetic dataset of 3,120 posts about depression-triggering stressors.
We conduct semantic and lexical analyses to identify the predominant stressors for each demographic group.
We present the procedures to generate queries to develop depression data using GPT-3, and conduct analyzes to uncover the types of stressors it assigns to demographic groups.
- Score: 27.13970925299262
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic data generation has the potential to impact applications and domains with scarce data. However, before such data is used for sensitive tasks such as mental health, we need an understanding of how different demographics are represented in it. In our paper, we analyze the potential of producing synthetic data using GPT-3 by exploring the various stressors it attributes to different race and gender combinations, to provide insight for future researchers looking into using LLMs for data generation. Using GPT-3, we develop HEADROOM, a synthetic dataset of 3,120 posts about depression-triggering stressors, by controlling for race, gender, and time frame (before and after COVID-19). Using this dataset, we conduct semantic and lexical analyses to (1) identify the predominant stressors for each demographic group; and (2) compare our synthetic data to a human-generated dataset. We present the procedures to generate queries to develop depression data using GPT-3, and conduct analyzes to uncover the types of stressors it assigns to demographic groups, which could be used to test the limitations of LLMs for synthetic data generation for depression data. Our findings show that synthetic data mimics some of the human-generated data distribution for the predominant depression stressors across diverse demographics.
Related papers
- Exploring the Impact of Synthetic Data for Aerial-view Human Detection [17.41001388151408]
Aerial-view human detection has a large demand for large-scale data to capture more diverse human appearances.
Synthetic data can be a good resource to expand data, but the domain gap with real-world data is the biggest obstacle to its use in training.
arXiv Detail & Related papers (2024-05-24T04:19:48Z) - A Demographic-Conditioned Variational Autoencoder for fMRI Distribution Sampling and Removal of Confounds [49.34500499203579]
We create a variational autoencoder (VAE)-based model, DemoVAE, to decorrelate fMRI features from demographics.
We generate high-quality synthetic fMRI data based on user-supplied demographics.
arXiv Detail & Related papers (2024-05-13T17:49:20Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Synthetic is all you need: removing the auxiliary data assumption for
membership inference attacks against synthetic data [9.061271587514215]
We show how this assumption can be removed, allowing for MIAs to be performed using only the synthetic data.
Our results show that MIAs are still successful, across two real-world datasets and two synthetic data generators.
arXiv Detail & Related papers (2023-07-04T13:16:03Z) - Exploring the Potential of AI-Generated Synthetic Datasets: A Case Study
on Telematics Data with ChatGPT [0.0]
This research delves into the construction and utilization of synthetic datasets, specifically within the telematics sphere, leveraging OpenAI's powerful language model, ChatGPT.
To illustrate this data creation process, a hands-on case study is conducted, focusing on the generation of a synthetic telematics dataset.
arXiv Detail & Related papers (2023-06-23T15:15:13Z) - Generating Faithful Synthetic Data with Large Language Models: A Case
Study in Computational Social Science [13.854807858791652]
We tackle a pervasive problem in synthetic data generation: its generative distribution often differs from the distribution of real-world data researchers care about.
We study three strategies to increase the faithfulness of synthetic data: grounding, filtering, and taxonomy-based generation.
We conclude this paper with some recommendations on how to generate high(er)-fidelity synthetic data for specific tasks.
arXiv Detail & Related papers (2023-05-24T11:27:59Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - SynBody: Synthetic Dataset with Layered Human Models for 3D Human
Perception and Modeling [93.60731530276911]
We introduce a new synthetic dataset, SynBody, with three appealing features.
The dataset comprises 1.2M images with corresponding accurate 3D annotations, covering 10,000 human body models, 1,187 actions, and various viewpoints.
arXiv Detail & Related papers (2023-03-30T13:30:12Z) - Label scarcity in biomedicine: Data-rich latent factor discovery
enhances phenotype prediction [102.23901690661916]
Low-dimensional embedding spaces can be derived from the UK Biobank population dataset to enhance data-scarce prediction of health indicators, lifestyle and demographic characteristics.
Performances gains from semisupervison approaches will probably become an important ingredient for various medical data science applications.
arXiv Detail & Related papers (2021-10-12T16:25:50Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Measuring Utility and Privacy of Synthetic Genomic Data [3.635321290763711]
We provide the first evaluation of the utility and the privacy protection of five state-of-the-art models for generating synthetic genomic data.
Overall, there is no single approach for generating synthetic genomic data that performs well across the board.
arXiv Detail & Related papers (2021-02-05T17:41:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.