Related papers: Reddit-Impacts: A Named Entity Recognition Dataset for Analyzing Clinical and Social Effects of Substance Use Derived from Social Media

Reddit-Impacts: A Named Entity Recognition Dataset for Analyzing Clinical and Social Effects of Substance Use Derived from Social Media

URL: http://arxiv.org/abs/2405.06145v1
Date: Thu, 9 May 2024 23:43:57 GMT
Title: Reddit-Impacts: A Named Entity Recognition Dataset for Analyzing Clinical and Social Effects of Substance Use Derived from Social Media
Authors: Yao Ge, Sudeshna Das, Karen O'Connor, Mohammed Ali Al-Garadi, Graciela Gonzalez-Hernandez, Abeed Sarker,
Abstract summary: Substance use disorders (SUDs) are a growing concern globally, necessitating enhanced understanding of the problem and its trends through data-driven research. Social media are unique and important sources of information about SUDs, particularly since the data in such sources are often generated by people with lived experiences. In this paper, we introduce Reddit-Impacts, a challenging Named Entity Recognition (NER) dataset curated from subreddits dedicated to discussions on prescription and illicit opioids, as well as medications for opioid use disorder. The dataset specifically concentrates on the lesser-studied, yet critically important, aspects of substance use--its
Score: 6.138126219622993
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Substance use disorders (SUDs) are a growing concern globally, necessitating enhanced understanding of the problem and its trends through data-driven research. Social media are unique and important sources of information about SUDs, particularly since the data in such sources are often generated by people with lived experiences. In this paper, we introduce Reddit-Impacts, a challenging Named Entity Recognition (NER) dataset curated from subreddits dedicated to discussions on prescription and illicit opioids, as well as medications for opioid use disorder. The dataset specifically concentrates on the lesser-studied, yet critically important, aspects of substance use--its clinical and social impacts. We collected data from chosen subreddits using the publicly available Application Programming Interface for Reddit. We manually annotated text spans representing clinical and social impacts reported by people who also reported personal nonmedical use of substances including but not limited to opioids, stimulants and benzodiazepines. Our objective is to create a resource that can enable the development of systems that can automatically detect clinical and social impacts of substance use from text-based social media data. The successful development of such systems may enable us to better understand how nonmedical use of substances affects individual health and societal dynamics, aiding the development of effective public health strategies. In addition to creating the annotated data set, we applied several machine learning models to establish baseline performances. Specifically, we experimented with transformer models like BERT, and RoBERTa, one few-shot learning model DANN by leveraging the full training dataset, and GPT-3.5 by using one-shot learning, for automatic NER of clinical and social impacts. The dataset has been made available through the 2024 SMM4H shared tasks.

Related papers

A Comprehensive Review of Datasets for Clinical Mental Health AI Systems [55.67299586253951]
We present the first comprehensive survey of clinical mental health datasets relevant to the training and development of AI-powered clinical assistants.<n>Our survey identifies critical gaps such as a lack of longitudinal data, limited cultural and linguistic representation, inconsistent collection and annotation standards, and a lack of modalities in synthetic data.
arXiv Detail & Related papers (2025-08-13T13:42:35Z)
SABIA: An AI-Powered Tool for Detecting Opioid-Related Behaviors on Social Media [5.191923980821674]
Social media platforms have become valuable tools for understanding public health challenges by offering insights into patient behaviors, medication use, and mental health issues.<n>This study addresses the issue of opioid-related user behavior on social media, including informal expressions, slang terms, and misspelled or coded language.<n>A new dataset was constructed from Reddit posts, identifying opioid user behaviors across five classes: Dealers, Active Opioid Users, Recovered Users, Prescription Users, and Non-Users, supported by detailed annotation guidelines.<n>Results show that SABIA achieved benchmark performance, outperforming the baseline (Logistic Regression, LR = 0.86) and improving accuracy by
arXiv Detail & Related papers (2025-08-12T06:52:41Z)
RSD-15K: A Large-Scale User-Level Annotated Dataset for Suicide Risk Detection on Social Media [0.0]
Social media is an important platform for individuals to express emotions and seek help.<n>This paper introduces a large-scale dataset containing 15,000 user-level posts.<n>Compared with existing datasets, this dataset retains complete user posting time sequence information.
arXiv Detail & Related papers (2025-07-14T09:26:26Z)
Crowdsourcing-Based Knowledge Graph Construction for Drug Side Effects Using Large Language Models with an Application on Semaglutide [8.621699521476371]
Social media is a rich source of real-world data that captures valuable patient experience information for pharmacovigilance. We present a framework that leverages large language models (LLMs) to extract medication side effects from social media and organize them into a knowledge graph (KG) Using the constructed knowledge graph, we perform comprehensive analyses to investigate reported side effects across different semaglutide brands over time.
arXiv Detail & Related papers (2025-04-06T03:47:44Z)
Opioid Named Entity Recognition (ONER-2025) from Reddit [5.641312824886231]
Social media platforms like Reddit provide vast amounts of unstructured data that offer insights into public perceptions, discussions, and experiences related to opioid use. This study leverages Natural Language Processing (NLP), specifically Opioid Named Entity Recognition (ONER-2025), to extract actionable information from these platforms. First, we created a unique, manually annotated dataset sourced from Reddit, where users share self-reported experiences of opioid use via different administration routes. Second, we detail our annotation process and guidelines while discussing the challenges of labeling the ONER-2025 dataset. Third, we analyze key linguistic challenges, including slang, ambiguity, fragmented
arXiv Detail & Related papers (2025-03-28T20:51:06Z)
myAURA: Personalized health library for epilepsy management via knowledge graph sparsification and visualization [4.25313339005458]
myAURA is an application designed to aid epilepsy patients, caregivers, and researchers in making decisions about care and self-management. MyAURA rests on the federation of heterogeneous data resources relevant to epilepsy, such as biomedical databases, social media, and electronic health records.
arXiv Detail & Related papers (2024-05-08T17:24:24Z)
"Hey..! This medicine made me sick": Sentiment Analysis of User-Generated Drug Reviews using Machine Learning Techniques [2.2874754079405535]
This project proposes a drug review classification system that classifies user reviews on a particular drug into different classes, such as positive, negative, and neutral. The collected data is manually labeled and verified manually to ensure that the labels are correct.
arXiv Detail & Related papers (2024-04-09T08:42:34Z)
Learning to Describe for Predicting Zero-shot Drug-Drug Interactions [54.172575323610175]
Adverse drug-drug interactions can compromise the effectiveness of concurrent drug administration. Traditional computational methods for DDI prediction may fail to capture interactions for new drugs due to the lack of knowledge. We propose TextDDI with a language model-based DDI predictor and a reinforcement learning(RL)-based information selector.
arXiv Detail & Related papers (2024-03-13T09:42:46Z)
An Annotated Dataset for Explainable Interpersonal Risk Factors of Mental Disturbance in Social Media Posts [0.0]
We construct and release a new annotated dataset with human-labelled explanations and classification of Interpersonal Risk Factors (IRF) affecting mental disturbance on social media. We establish baseline models on our dataset facilitating future research directions to develop real-time personalized AI models by detecting patterns of TBe and PBu in emotional spectrum of user's historical social media profile.
arXiv Detail & Related papers (2023-05-30T04:08:40Z)
EBOCA: Evidences for BiOmedical Concepts Association Ontology [55.41644538483948]
This paper proposes EBOCA, an ontology that describes (i) biomedical domain concepts and associations between them, and (ii) evidences supporting these associations. Test data coming from a subset of DISNET and automatic association extractions from texts has been transformed to create a Knowledge Graph that can be used in real scenarios.
arXiv Detail & Related papers (2022-08-01T18:47:03Z)
DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise Annotations [90.27736364704108]
We present DrugOOD, a systematic OOD dataset curator and benchmark for AI-aided drug discovery. DrugOOD comes with an open-source Python package that fully automates benchmarking processes. We focus on one of the most crucial problems in AIDD: drug target binding affinity prediction.
arXiv Detail & Related papers (2022-01-24T12:32:48Z)
Data Augmentation for Mental Health Classification on Social Media [0.0]
The mental disorder of online users is determined using social media posts. The major challenge in this domain is to avail the ethical clearance for using the user generated text on social media platforms. We have studied the effect of data augmentation techniques on domain specific user generated text for mental health classification.
arXiv Detail & Related papers (2021-12-19T05:09:01Z)
Heterogeneous Treatment Effect Estimation using machine learning for Healthcare application: tutorial and benchmark [8.869515663374248]
Many studies have shown that drugs effects are heterogeneous among the population. Lots of advanced machine learning models about estimating heterogeneous treatment effects (HTE) have emerged in recent years. We aim to introduce the HTE methodology to the healthcare area and provide feasibility consideration.
arXiv Detail & Related papers (2021-09-27T02:34:44Z)
Two-Faced Humans on Twitter and Facebook: Harvesting Social Multimedia for Human Personality Profiling [74.83957286553924]
We infer the Myers-Briggs Personality Type indicators by applying a novel multi-view fusion framework, called "PERS" Our experimental results demonstrate the PERS's ability to learn from multi-view data for personality profiling by efficiently leveraging on the significantly different data arriving from diverse social multimedia sources.
arXiv Detail & Related papers (2021-06-20T10:48:49Z)
Assessing the Severity of Health States based on Social Media Posts [62.52087340582502]
We propose a multiview learning framework that models both the textual content as well as contextual-information to assess the severity of the user's health state. The diverse NLU views demonstrate its effectiveness on both the tasks and as well as on the individual disease to assess a user's health.
arXiv Detail & Related papers (2020-09-21T03:45:14Z)
I Know Where You Are Coming From: On the Impact of Social Media Sources on AI Model Performance [79.05613148641018]
We will study the performance of different machine learning models when being learned on multi-modal data from different social networks. Our initial experimental results reveal that social network choice impacts the performance.
arXiv Detail & Related papers (2020-02-05T11:10:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.