UCE-FID: Using Large Unlabeled, Medium Crowdsourced-Labeled, and Small
Expert-Labeled Tweets for Foodborne Illness Detection
- URL: http://arxiv.org/abs/2312.01225v1
- Date: Sat, 2 Dec 2023 21:03:23 GMT
- Title: UCE-FID: Using Large Unlabeled, Medium Crowdsourced-Labeled, and Small
Expert-Labeled Tweets for Foodborne Illness Detection
- Authors: Ruofan Hu, Dongyu Zhang, Dandan Tao, Huayi Zhang, Hao Feng, and Elke
Rundensteiner
- Abstract summary: We propose EGAL, a deep learning framework for foodborne illness detection.
EGAL uses small expert-labeled tweets augmented by crowdsourced-labeled and massive unlabeled data.
EGAL has the potential to be deployed for real-time analysis of tweet streaming, contributing to foodborne illness outbreak surveillance efforts.
- Score: 8.934980946374367
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Foodborne illnesses significantly impact public health. Deep learning
surveillance applications using social media data aim to detect early warning
signals. However, labeling foodborne illness-related tweets for model training
requires extensive human resources, making it challenging to collect a
sufficient number of high-quality labels for tweets within a limited budget.
The severe class imbalance resulting from the scarcity of foodborne
illness-related tweets among the vast volume of social media further
exacerbates the problem. Classifiers trained on a class-imbalanced dataset are
biased towards the majority class, making accurate detection difficult. To
overcome these challenges, we propose EGAL, a deep learning framework for
foodborne illness detection that uses small expert-labeled tweets augmented by
crowdsourced-labeled and massive unlabeled data. Specifically, by leveraging
tweets labeled by experts as a reward set, EGAL learns to assign a weight of
zero to incorrectly labeled tweets to mitigate their negative influence. Other
tweets receive proportionate weights to counter-balance the unbalanced class
distribution. Extensive experiments on real-world \textit{TWEET-FID} data show
that EGAL outperforms strong baseline models across different settings,
including varying expert-labeled set sizes and class imbalance ratios. A case
study on a multistate outbreak of Salmonella Typhimurium infection linked to
packaged salad greens demonstrates how the trained model captures relevant
tweets offering valuable outbreak insights. EGAL, funded by the U.S. Department
of Agriculture (USDA), has the potential to be deployed for real-time analysis
of tweet streaming, contributing to foodborne illness outbreak surveillance
efforts.
Related papers
- Epidemiology-informed Network for Robust Rumor Detection [59.89351792706995]
We propose a novel Epidemiology-informed Network (EIN) that integrates epidemiological knowledge to enhance performance.
To adapt epidemiology theory to rumor detection, it is expected that each users stance toward the source information will be annotated.
Our experimental results demonstrate that the proposed EIN not only outperforms state-of-the-art methods on real-world datasets but also exhibits enhanced robustness across varying tree depths.
arXiv Detail & Related papers (2024-11-20T00:43:32Z) - CrisisMatch: Semi-Supervised Few-Shot Learning for Fine-Grained Disaster
Tweet Classification [51.58605842457186]
We present a fine-grained disaster tweet classification model under the semi-supervised, few-shot learning setting.
Our model, CrisisMatch, effectively classifies tweets into fine-grained classes of interest using few labeled data and large amounts of unlabeled data.
arXiv Detail & Related papers (2023-10-23T07:01:09Z) - Named Entity Recognition for Monitoring Plant Health Threats in Tweets:
a ChouBERT Approach [0.0]
ChouBERT is a pre-trained language model that can identify Tweets concerning observations of plant health issues with generalizability on unseen natural hazards.
This paper tackles the lack of labelled data by further studying ChouBERT's know-how on token-level annotation tasks over small labeled sets.
arXiv Detail & Related papers (2023-10-19T06:54:55Z) - A Novel Site-Agnostic Multimodal Deep Learning Model to Identify
Pro-Eating Disorder Content on Social Media [0.0]
This study aimed to create a multimodal deep learning model that can determine if a social media post promotes eating disorders.
A labeled dataset of Tweets was collected from Twitter, recently rebranded as X, upon which twelve deep learning models were trained and evaluated.
The RoBERTa and MaxViT fusion model, deployed to classify an unlabeled dataset of posts from the social media sites Tumblr and Reddit, generated results akin to those of previous research studies.
arXiv Detail & Related papers (2023-07-06T16:04:46Z) - Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information.
By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples.
We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z) - RevealED: Uncovering Pro-Eating Disorder Content on Twitter Using Deep
Learning [0.0]
This study aimed to create a deep learning model capable of determining whether a social media post promotes eating disorders based solely on image data.
Several deep-learning models were trained on the scraped dataset and were evaluated based on their accuracy, F1 score, precision, and recall.
The model, which was applied to unlabeled Twitter image data scraped from "#selfie", uncovered seasonal fluctuations in the relative abundance of pro-eating disorder content.
arXiv Detail & Related papers (2022-12-28T16:50:49Z) - Attend Who is Weak: Pruning-assisted Medical Image Localization under
Sophisticated and Implicit Imbalances [102.68466217374655]
Deep neural networks (DNNs) have rapidly become a textitde facto choice for medical image understanding tasks.
In this paper, we propose to use pruning to automatically and adaptively identify textithard-to-learn (HTL) training samples.
We also present an interesting demographic analysis which illustrates HTLs ability to capture complex demographic imbalances.
arXiv Detail & Related papers (2022-12-06T00:32:03Z) - TWEET-FID: An Annotated Dataset for Multiple Foodborne Illness Detection
Tasks [14.523433519237607]
Foodborne illness is a serious but preventable public health problem.
There is a dearth of labeled datasets for developing effective outbreak detection models.
We present TWEET-FID, the first publicly available annotated dataset for foodborne illness incident detection tasks.
arXiv Detail & Related papers (2022-05-22T03:47:18Z) - Robust Deep Semi-Supervised Learning: A Brief Introduction [63.09703308309176]
Semi-supervised learning (SSL) aims to improve learning performance by leveraging unlabeled data when labels are insufficient.
SSL with deep models has proven to be successful on standard benchmark tasks.
However, they are still vulnerable to various robustness threats in real-world applications.
arXiv Detail & Related papers (2022-02-12T04:16:41Z) - Combining exogenous and endogenous signals with a semi-supervised
co-attention network for early detection of COVID-19 fake tweets [14.771202995527315]
During COVID-19, tweets with misinformation should be flagged and neutralized in their early stages to mitigate the damages.
Most of the existing methods for early detection of fake news assume to have enough propagation information for large labeled tweets.
We present ENDEMIC, a novel early detection model which leverages endogenous and endogenous signals related to tweets.
arXiv Detail & Related papers (2021-04-12T10:01:44Z) - Leveraging Multi-Source Weak Social Supervision for Early Detection of
Fake News [67.53424807783414]
Social media has greatly enabled people to participate in online activities at an unprecedented rate.
This unrestricted access also exacerbates the spread of misinformation and fake news online which might cause confusion and chaos unless being detected early for its mitigation.
We jointly leverage the limited amount of clean data along with weak signals from social engagements to train deep neural networks in a meta-learning framework to estimate the quality of different weak instances.
Experiments on realworld datasets demonstrate that the proposed framework outperforms state-of-the-art baselines for early detection of fake news without using any user engagements at prediction time.
arXiv Detail & Related papers (2020-04-03T18:26:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.