RedHOT: A Corpus of Annotated Medical Questions, Experiences, and Claims
on Social Media
- URL: http://arxiv.org/abs/2210.06331v1
- Date: Wed, 12 Oct 2022 15:50:32 GMT
- Title: RedHOT: A Corpus of Annotated Medical Questions, Experiences, and Claims
on Social Media
- Authors: Somin Wadhwa, Vivek Khetan, Silvio Amir, Byron Wallace
- Abstract summary: We present Reddit Health Online Talk (RedHOT), a corpus of 22,000 richly annotated social media posts from Reddit spanning 24 health conditions.
We mark snippets that describe patient Populations, Interventions, and Outcomes (PIO elements) within these claims.
We propose a new method to automatically derive (noisy) supervision for this task which we use to train a dense retrieval model.
- Score: 1.5293427903448022
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Reddit Health Online Talk (RedHOT), a corpus of 22,000 richly
annotated social media posts from Reddit spanning 24 health conditions.
Annotations include demarcations of spans corresponding to medical claims,
personal experiences, and questions. We collect additional granular annotations
on identified claims. Specifically, we mark snippets that describe patient
Populations, Interventions, and Outcomes (PIO elements) within these. Using
this corpus, we introduce the task of retrieving trustworthy evidence relevant
to a given claim made on social media. We propose a new method to automatically
derive (noisy) supervision for this task which we use to train a dense
retrieval model; this outperforms baseline models. Manual evaluation of
retrieval results performed by medical doctors indicate that while our system
performance is promising, there is considerable room for improvement. Collected
annotations (and scripts to assemble the dataset), are available at
https://github.com/sominw/redhot.
Related papers
- FedIA: Federated Medical Image Segmentation with Heterogeneous Annotation Completeness [30.780654470392125]
Federated learning has emerged as a compelling paradigm for medical image segmentation.
This paper highlights a prevalent challenge in medical practice: incomplete annotations.
We introduce a novel solution, named FedIA, to tackle this issue.
arXiv Detail & Related papers (2024-07-02T14:08:55Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Streamlining Social Media Information Retrieval for COVID-19 Research with Deep Learning [19.675191059975877]
Social media-based public health research is crucial for epidemic surveillance.
Most studies identify relevant corpora with keyword-matching.
This study develops a system to streamline the process of curating colloquial medical dictionaries.
arXiv Detail & Related papers (2023-06-28T08:20:35Z) - Semantic Similarity Models for Depression Severity Estimation [53.72188878602294]
This paper presents an efficient semantic pipeline to study depression severity in individuals based on their social media writings.
We use test user sentences for producing semantic rankings over an index of representative training sentences corresponding to depressive symptoms and severity levels.
We evaluate our methods on two Reddit-based benchmarks, achieving 30% improvement over state of the art in terms of measuring depression severity.
arXiv Detail & Related papers (2022-11-14T18:47:26Z) - MedJEx: A Medical Jargon Extraction Model with Wiki's Hyperlink Span and
Contextualized Masked Language Model Score [6.208127495081593]
We present a novel and publicly available dataset with expert-annotated medical jargon terms from 18K+ EHR note sentences.
We then introduce a novel medical jargon extraction ($MedJEx$) model which has been shown to outperform existing state-of-the-art NLP models.
arXiv Detail & Related papers (2022-10-12T02:27:32Z) - Medical Question Understanding and Answering with Knowledge Grounding
and Semantic Self-Supervision [53.692793122749414]
We introduce a medical question understanding and answering system with knowledge grounding and semantic self-supervision.
Our system is a pipeline that first summarizes a long, medical, user-written question, using a supervised summarization loss.
The system first matches the summarized user question with an FAQ from a trusted medical knowledge base, and then retrieves a fixed number of relevant sentences from the corresponding answer document.
arXiv Detail & Related papers (2022-09-30T08:20:32Z) - Text Mining to Identify and Extract Novel Disease Treatments From
Unstructured Datasets [56.38623317907416]
We use Google Cloud to transcribe podcast episodes of an NPR radio show.
We then build a pipeline for systematically pre-processing the text.
Our model successfully identified that Omeprazole can help treat heartburn.
arXiv Detail & Related papers (2020-10-22T19:52:49Z) - COMETA: A Corpus for Medical Entity Linking in the Social Media [27.13349965075764]
We introduce a new corpus called COMETA, consisting of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT.
Our corpus satisfies a combination of desirable properties, from scale and coverage to diversity and quality.
We shed light on the ability of these systems to perform complex inference on entities and concepts under 2 challenging evaluation scenarios.
arXiv Detail & Related papers (2020-10-07T09:16:45Z) - BiteNet: Bidirectional Temporal Encoder Network to Predict Medical
Outcomes [53.163089893876645]
We propose a novel self-attention mechanism that captures the contextual dependency and temporal relationships within a patient's healthcare journey.
An end-to-end bidirectional temporal encoder network (BiteNet) then learns representations of the patient's journeys.
We have evaluated the effectiveness of our methods on two supervised prediction and two unsupervised clustering tasks with a real-world EHR dataset.
arXiv Detail & Related papers (2020-09-24T00:42:36Z) - Extracting Structured Data from Physician-Patient Conversations By
Predicting Noteworthy Utterances [39.888619005843246]
We describe a new dataset consisting of conversation transcripts, post-visit summaries, corresponding supporting evidence (in the transcript), and structured labels.
One methodological challenge is that the conversations are long (around 1500 words) making it difficult for modern deep-learning models to use them as input.
We find that by first filtering for (predicted) noteworthy utterances, we can significantly boost predictive performance for recognizing both diagnoses and RoS abnormalities.
arXiv Detail & Related papers (2020-07-14T16:10:37Z) - Learning Contextualized Document Representations for Healthcare Answer
Retrieval [68.02029435111193]
Contextual Discourse Vectors (CDV) is a distributed document representation for efficient answer retrieval from long documents.
Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse.
We show that our generalized model significantly outperforms several state-of-the-art baselines for healthcare passage ranking.
arXiv Detail & Related papers (2020-02-03T15:47:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.