A PubMedBERT-based Classifier with Data Augmentation Strategy for
Detecting Medication Mentions in Tweets
- URL: http://arxiv.org/abs/2112.02998v1
- Date: Wed, 3 Nov 2021 14:29:24 GMT
- Title: A PubMedBERT-based Classifier with Data Augmentation Strategy for
Detecting Medication Mentions in Tweets
- Authors: Qing Han, Shubo Tian, Jinfeng Zhang
- Abstract summary: Twitter publishes a large number of user-generated text (tweets) on a daily basis.
entity recognition (NER) presents some special challenges for tweet data.
In this paper, we explore a PubMedBERT-based classifier trained with a combination of multiple data augmentation approaches.
Our method achieved an F1 score of 0.762, which is substantially higher than the mean of all submissions.
- Score: 2.539568419434224
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As a major social media platform, Twitter publishes a large number of
user-generated text (tweets) on a daily basis. Mining such data can be used to
address important social, public health, and emergency management issues that
are infeasible through other means. An essential step in many text mining
pipelines is named entity recognition (NER), which presents some special
challenges for tweet data. Among them are nonstandard expressions, extreme
imbalanced classes, and lack of context information, etc. The track 3 of
BioCreative challenge VII (BC7) was organized to evaluate methods for detecting
medication mentions in tweets. In this paper, we report our work on BC7 track
3, where we explored a PubMedBERT-based classifier trained with a combination
of multiple data augmentation approaches. Our method achieved an F1 score of
0.762, which is substantially higher than the mean of all submissions (0.696).
Related papers
- ThangDLU at #SMM4H 2024: Encoder-decoder models for classifying text data on social disorders in children and adolescents [49.00494558898933]
This paper describes our participation in Task 3 and Task 5 of the #SMM4H (Social Media Mining for Health) 2024 Workshop.
Task 3 is a multi-class classification task centered on tweets discussing the impact of outdoor environments on symptoms of social anxiety.
Task 5 involves a binary classification task focusing on tweets reporting medical disorders in children.
We applied transfer learning from pre-trained encoder-decoder models such as BART-base and T5-small to identify the labels of a set of given tweets.
arXiv Detail & Related papers (2024-04-30T17:06:20Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media [74.93847489218008]
We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information.
To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles.
Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance.
arXiv Detail & Related papers (2023-05-23T16:40:07Z) - ViralBERT: A User Focused BERT-Based Approach to Virality Prediction [11.992815669875924]
We propose ViralBERT, which can be used to predict the virality of tweets using content- and user-based features.
We employ a method of concatenating numerical features such as hashtags and follower numbers to tweet text, and utilise two BERT modules.
We collect a dataset of 330k tweets to train ViralBERT and validate the efficacy of our model using baselines from current studies in this field.
arXiv Detail & Related papers (2022-05-17T21:40:24Z) - Twitter-COMMs: Detecting Climate, COVID, and Military Multimodal
Misinformation [83.2079454464572]
This paper describes our approach to the Image-Text Inconsistency Detection challenge of the DARPA Semantic Forensics (SemaFor) Program.
We collect Twitter-COMMs, a large-scale multimodal dataset with 884k tweets relevant to the topics of Climate Change, COVID-19, and Military Vehicles.
We train our approach, based on the state-of-the-art CLIP model, leveraging automatically generated random and hard negatives.
arXiv Detail & Related papers (2021-12-16T03:37:20Z) - Automatic Extraction of Medication Names in Tweets as Named Entity
Recognition [3.7462395049372894]
Biocreative VII Task 3 focuses on mining this information by recognizing mentions of medications and dietary supplements in tweets.
We approach this task by fine tuning multiple BERT-style language models to perform token-level classification.
Our best system consists of five Megatron-BERT-345M models and achieves a strict F1 score of 0.764 on unseen test data.
arXiv Detail & Related papers (2021-11-30T18:25:32Z) - Extraction of Medication Names from Twitter Using Augmentation and an
Ensemble of Language Models [55.44979919361194]
The BioCreative VII Track 3 challenge focused on the identification of medication names in Twitter user timelines.
For our submission to this challenge, we expanded the available training data by using several data augmentation techniques.
The augmented data was then used to fine-tune an ensemble of language models that had been pre-trained on general-domain Twitter content.
arXiv Detail & Related papers (2021-11-12T11:18:46Z) - I-AID: Identifying Actionable Information from Disaster-related Tweets [0.0]
Social media plays a significant role in disaster management by providing valuable data about affected people, donations and help requests.
We propose I-AID, a multimodel approach to automatically categorize tweets into multi-label information types.
Our results indicate that I-AID outperforms state-of-the-art approaches in terms of weighted average F1 score by +6% and +4% on the TREC-IS dataset and COVID-19 Tweets, respectively.
arXiv Detail & Related papers (2020-08-04T19:07:50Z) - Students Need More Attention: BERT-based AttentionModel for Small Data
with Application to AutomaticPatient Message Triage [65.7062363323781]
We propose a novel framework based on BioBERT (Bidirectional Representations from Transformers forBiomedical TextMining)
We introduce Label Embeddings for Self-Attention in each layer of BERT, which we call LESA-BERT, and (ii) by distilling LESA-BERT to smaller variants, we aim to reduce overfitting and model size when working on small datasets.
As an application, our framework is utilized to build a model for patient portal message triage that classifies the urgency of a message into three categories: non-urgent, medium and urgent.
arXiv Detail & Related papers (2020-06-22T03:39:00Z) - Utilizing Deep Learning to Identify Drug Use on Twitter Data [0.0]
The classification power of multiple methods was compared including support vector machines (SVM), XGBoost, and convolutional neural network (CNN) based classifiers.
The accuracy scores were 76.35% and 82.31%, with an AUC of 0.90 and 0.91.
The synthetically generated set provided increased scores, improving the classification capability and proving the worth of this methodology.
arXiv Detail & Related papers (2020-03-08T07:52:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.