Related papers: Large Language Models Perform on Par with Experts Identifying Mental Health Factors in Adolescent Online Forums

Large Language Models Perform on Par with Experts Identifying Mental Health Factors in Adolescent Online Forums

URL: http://arxiv.org/abs/2404.16461v2
Date: Fri, 26 Apr 2024 11:36:28 GMT
Title: Large Language Models Perform on Par with Experts Identifying Mental Health Factors in Adolescent Online Forums
Authors: Isabelle Lorge, Dan W. Joyce, Andrey Kormilitzin,
Abstract summary: Mental health in children and adolescents has been steadily deteriorating over the past few years. We create a new dataset of Reddit posts from adolescents aged 12-19 annotated by expert psychiatrists for the following categories: TRAUMA, PRECARITY, CONDITION, SYMPTOMS, SUICIDALITY and TREATMENT. We create two synthetic datasets to assess whether LLMs perform better when annotating data as they generate it. We find GPT4 to be on par with human inter-annotator agreement and performance on synthetic data to be substantially higher.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mental health in children and adolescents has been steadily deteriorating over the past few years. The recent advent of Large Language Models (LLMs) offers much hope for cost and time efficient scaling of monitoring and intervention, yet despite specifically prevalent issues such as school bullying and eating disorders, previous studies on have not investigated performance in this domain or for open information extraction where the set of answers is not predetermined. We create a new dataset of Reddit posts from adolescents aged 12-19 annotated by expert psychiatrists for the following categories: TRAUMA, PRECARITY, CONDITION, SYMPTOMS, SUICIDALITY and TREATMENT and compare expert labels to annotations from two top performing LLMs (GPT3.5 and GPT4). In addition, we create two synthetic datasets to assess whether LLMs perform better when annotating data as they generate it. We find GPT4 to be on par with human inter-annotator agreement and performance on synthetic data to be substantially higher, however we find the model still occasionally errs on issues of negation and factuality and higher performance on synthetic data is driven by greater complexity of real data rather than inherent advantage.

Related papers

LSM-2: Learning from Incomplete Wearable Sensor Data [65.58595667477505]
This paper introduces the second generation of Large Sensor Model (LSM-2) with Adaptive and Inherited Masking (AIM)<n>AIM learns robust representations directly from incomplete data without requiring explicit imputation.<n>Our LSM-2 with AIM achieves the best performance across a diverse range of tasks, including classification, regression and generative modeling.
arXiv Detail & Related papers (2025-06-05T17:57:11Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment. We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews. Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
Evaluating Language Models as Synthetic Data Generators [74.80905172696366]
AgoraBench is a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities.
arXiv Detail & Related papers (2024-12-04T19:20:32Z)
Synthetic Data Generation with LLM for Improved Depression Prediction [5.508617844957542]
We propose a pipeline for Large Language Models to generate synthetic data to improve the performance of depression prediction models. Not only was the synthetic data satisfactory in terms of fidelity and privacy-preserving metrics, it also balanced the distribution of severity in the training dataset.
arXiv Detail & Related papers (2024-11-26T18:31:14Z)
Chatting Up Attachment: Using LLMs to Predict Adult Bonds [0.0]
We use GPT-4 and Claude 3 Opus to create agents that simulate adults with varying profiles, childhood memories, and attachment styles. We evaluate our models using a transcript dataset from 9 humans who underwent the same interview protocol, analyzed and labeled by mental health professionals. Our findings indicate that training the models using only synthetic data achieves performance comparable to training the models on human data.
arXiv Detail & Related papers (2024-08-31T04:29:19Z)
Utilizing Large Language Models to Generate Synthetic Data to Increase the Performance of BERT-Based Neural Networks [0.7071166713283337]
We created datasets large enough to train machine learning models. Our goal is to label behaviors corresponding to autism criteria. Augmenting data increased recall by 13% but decreased precision by 16%.
arXiv Detail & Related papers (2024-05-08T03:18:12Z)
Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study [61.74571814707054]
We evaluate whether every generated sentence is grounded in retrieved documents or the model's pre-training data. Across 3 datasets and 4 model families, our findings reveal that a significant fraction of generated sentences are consistently ungrounded. Our results show that while larger models tend to ground their outputs more effectively, a significant portion of correct answers remains compromised by hallucinations.
arXiv Detail & Related papers (2024-04-10T14:50:10Z)
ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs [65.9625653425636]
Large Language models (LLMs) exhibit harmful social biases. This work introduces a novel approach utilizing ChatGPT to generate synthetic training data.
arXiv Detail & Related papers (2024-02-19T01:28:48Z)
Detecting the Clinical Features of Difficult-to-Treat Depression using Synthetic Data from Large Language Models [0.20971479389679337]
We seek to develop a Large Language Model (LLM)-based tool capable of interrogating routinely-collected, narrative (free-text) electronic health record data. We use LLM-generated synthetic data (GPT3.5) and a Non-Maximum Suppression (NMS) algorithm to train a BERT-based span extraction model. We show it is possible to obtain good overall performance (0.70 F1 across polarity) on real clinical data on a set of as many as 20 different factors, and high performance (0.85 F1 with 0.95 precision) on a subset of important DTD
arXiv Detail & Related papers (2024-02-12T13:34:33Z)
"Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation [90.09260023184932]
Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
arXiv Detail & Related papers (2023-12-18T17:18:04Z)
The Parrot Dilemma: Human-Labeled vs. LLM-augmented Data in Classification Tasks [0.0]
We compare the use of human-labeled data with synthetically generated data from GPT-4 and Llama-2 in ten distinct CSS classification tasks. Our findings reveal that models trained on human-labeled data consistently exhibit superior or comparable performance compared to their synthetically augmented counterparts.
arXiv Detail & Related papers (2023-04-26T23:09:02Z)
Large Language Models for Healthcare Data Augmentation: An Example on Patient-Trial Matching [49.78442796596806]
We propose an innovative privacy-aware data augmentation approach for patient-trial matching (LLM-PTM) Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%.
arXiv Detail & Related papers (2023-03-24T03:14:00Z)
Does Synthetic Data Generation of LLMs Help Clinical Text Mining? [51.205078179427645]
We investigate the potential of OpenAI's ChatGPT to aid in clinical text mining. We propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data. Our method has resulted in significant improvements in the performance of downstream tasks.
arXiv Detail & Related papers (2023-03-08T03:56:31Z)
DeepRite: Deep Recurrent Inverse TreatmEnt Weighting for Adjusting Time-varying Confounding in Modern Longitudinal Observational Data [68.29870617697532]
We propose Deep Recurrent Inverse TreatmEnt weighting (DeepRite) for time-varying confounding in longitudinal data. DeepRite is shown to recover the ground truth from synthetic data, and estimate unbiased treatment effects from real data.
arXiv Detail & Related papers (2020-10-28T15:05:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.