Socially Aware Synthetic Data Generation for Suicidal Ideation Detection
Using Large Language Models
- URL: http://arxiv.org/abs/2402.01712v1
- Date: Thu, 25 Jan 2024 18:25:05 GMT
- Title: Socially Aware Synthetic Data Generation for Suicidal Ideation Detection
Using Large Language Models
- Authors: Hamideh Ghanadian, Isar Nejadgholi, Hussein Al Osman
- Abstract summary: We introduce an innovative strategy that leverages the capabilities of generative AI models to create synthetic data for suicidal ideation detection.
We benchmarked against state-of-the-art NLP classification models, specifically, those centered around the BERT family structures.
Our synthetic data-driven method, informed by social factors, offers consistent F1-scores of 0.82 for both models.
- Score: 8.832297887534445
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Suicidal ideation detection is a vital research area that holds great
potential for improving mental health support systems. However, the sensitivity
surrounding suicide-related data poses challenges in accessing large-scale,
annotated datasets necessary for training effective machine learning models. To
address this limitation, we introduce an innovative strategy that leverages the
capabilities of generative AI models, such as ChatGPT, Flan-T5, and Llama, to
create synthetic data for suicidal ideation detection. Our data generation
approach is grounded in social factors extracted from psychology literature and
aims to ensure coverage of essential information related to suicidal ideation.
In our study, we benchmarked against state-of-the-art NLP classification
models, specifically, those centered around the BERT family structures. When
trained on the real-world dataset, UMD, these conventional models tend to yield
F1-scores ranging from 0.75 to 0.87. Our synthetic data-driven method, informed
by social factors, offers consistent F1-scores of 0.82 for both models,
suggesting that the richness of topics in synthetic data can bridge the
performance gap across different model complexities. Most impressively, when we
combined a mere 30% of the UMD dataset with our synthetic data, we witnessed a
substantial increase in performance, achieving an F1-score of 0.88 on the UMD
test set. Such results underscore the cost-effectiveness and potential of our
approach in confronting major challenges in the field, such as data scarcity
and the quest for diversity in data representation.
Related papers
- Zero-shot generation of synthetic neurosurgical data with large language models [0.7373617024876725]
This study aims to evaluate the capability of zero-shot generation of synthetic neurosurgical data with a large language model (LLM), GPT-4o.
Data synthesized with GPT-4o can effectively augment clinical data with small sample sizes, and train ML models for prediction of neurosurgical outcomes.
arXiv Detail & Related papers (2025-02-13T18:21:15Z) - Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation [2.4374097382908477]
We investigate whether synthetic data can be leveraged to create artificial students for serving learning analytics models.
Our results demonstrate the strong potential of these methods to produce high-quality synthetic datasets that resemble real students data.
arXiv Detail & Related papers (2025-01-03T12:52:51Z) - Evaluating Language Models as Synthetic Data Generators [74.80905172696366]
AgoraBench is a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities.
Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities.
arXiv Detail & Related papers (2024-12-04T19:20:32Z) - Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data [104.30479583607918]
2nd FRCSyn-onGoing challenge is based on the 2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn), originally launched at CVPR 2024.
We focus on exploring the use of synthetic data both individually and in combination with real data to solve current challenges in face recognition.
arXiv Detail & Related papers (2024-12-02T11:12:01Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - FLIGAN: Enhancing Federated Learning with Incomplete Data using GAN [1.5749416770494706]
Federated Learning (FL) provides a privacy-preserving mechanism for distributed training of machine learning models on networked devices.
We propose FLIGAN, a novel approach to address the issue of data incompleteness in FL.
Our methodology adheres to FL's privacy requirements by generating synthetic data in a federated manner without sharing the actual data in the process.
arXiv Detail & Related papers (2024-03-25T16:49:38Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Synthetic Data Generation with Large Language Models for Text
Classification: Potential and Limitations [21.583825474908334]
We study how the performance of models trained on synthetic data may vary with the subjectivity of classification.
Our results indicate that subjectivity, at both the task level and instance level, is negatively associated with the performance of the model trained on synthetic data.
arXiv Detail & Related papers (2023-10-11T19:51:13Z) - Does Synthetic Data Make Large Language Models More Efficient? [0.0]
This paper explores the nuances of synthetic data generation in NLP.
We highlight its advantages, including data augmentation potential and the introduction of structured variety.
We demonstrate the impact of template-based synthetic data on the performance of modern transformer models.
arXiv Detail & Related papers (2023-10-11T19:16:09Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.