EPIC30M: An Epidemics Corpus Of Over 30 Million Relevant Tweets
- URL: http://arxiv.org/abs/2006.08369v2
- Date: Mon, 22 Jun 2020 17:08:45 GMT
- Title: EPIC30M: An Epidemics Corpus Of Over 30 Million Relevant Tweets
- Authors: Junhua Liu, Trisha Singhal, Lucienne T.M. Blessing, Kristin L. Wood
and Kwan Hui Lim
- Abstract summary: EPIC30M is a large-scale epidemic corpus that contains 30 millions micro-blog posts crawled from Twitter.
EPIC30M contains a subset of 26.2 millions tweets related to three general diseases, namely Ebola, Cholera and Swine Flu, and another subset of 4.7 millions tweets of six global epidemic outbreaks, including 2009 H1N1 Swine Flu, 2010 Haiti Cholera, 2012 Middle-East Respiratory Syndrome (MERS), 2013 West African Ebola, 2016 Yemen Cholera and 2018 Kivu Ebola.
- Score: 2.7718973516070684
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since the start of COVID-19, several relevant corpora from various sources
are presented in the literature that contain millions of data points. While
these corpora are valuable in supporting many analyses on this specific
pandemic, researchers require additional benchmark corpora that contain other
epidemics to facilitate cross-epidemic pattern recognition and trend analysis
tasks. During our other efforts on COVID-19 related work, we discover very
little disease related corpora in the literature that are sizable and rich
enough to support such cross-epidemic analysis tasks. In this paper, we present
EPIC30M, a large-scale epidemic corpus that contains 30 millions micro-blog
posts, i.e., tweets crawled from Twitter, from year 2006 to 2020. EPIC30M
contains a subset of 26.2 millions tweets related to three general diseases,
namely Ebola, Cholera and Swine Flu, and another subset of 4.7 millions tweets
of six global epidemic outbreaks, including 2009 H1N1 Swine Flu, 2010 Haiti
Cholera, 2012 Middle-East Respiratory Syndrome (MERS), 2013 West African Ebola,
2016 Yemen Cholera and 2018 Kivu Ebola. Furthermore, we explore and discuss the
properties of the corpus with statistics of key terms and hashtags and trends
analysis for each subset. Finally, we demonstrate the value and impact that
EPIC30M could create through a discussion of multiple use cases of
cross-epidemic research topics that attract growing interest in recent years.
These use cases span multiple research areas, such as epidemiological modeling,
pattern recognition, natural language understanding and economical modeling.
Related papers
- SPEED++: A Multilingual Event Extraction Framework for Epidemic Prediction and Preparedness [73.73883111570458]
We introduce the first multilingual Event Extraction framework for extracting epidemic event information for a wide range of diseases and languages.
Annotating data in every language is infeasible; thus we develop zero-shot cross-lingual cross-disease models.
Our framework can provide epidemic warnings for COVID-19 in its earliest stages in Dec 2019 from Chinese Weibo posts without any training in Chinese.
arXiv Detail & Related papers (2024-10-24T03:03:54Z) - Sentiment Analysis and Text Analysis of the Public Discourse on Twitter
about COVID-19 and MPox [0.0]
The recent outbreaks of COVID-19 and MPox have served as catalysts for Twitter usage related to seeking and sharing information, views, opinions, and sentiments involving both of these viruses.
None of the prior works in this field analyzed tweets focusing on both COVID-19 and MPox simultaneously.
To address this research gap, a total of 61,862 tweets that focused on MPox and COVID-19 simultaneously, posted between 7 May 2022 and 3 March 2023, were studied.
arXiv Detail & Related papers (2023-12-17T01:50:27Z) - COVID-19 Vaccine Misinformation in Middle Income Countries [5.891662430960944]
This paper introduces a multilingual dataset of COVID-19 vaccine misinformation, consisting of annotated tweets from three middle-income countries: Brazil, Indonesia, and Nigeria.
The dataset includes annotations for 5,952 tweets, assessing their relevance to COVID-19 vaccines, presence of misinformation, and the themes of the misinformation.
arXiv Detail & Related papers (2023-11-30T02:27:34Z) - Human Behavior in the Time of COVID-19: Learning from Big Data [71.26355067309193]
Since March 2020, there have been over 600 million confirmed cases of COVID-19 and more than six million deaths.
The pandemic has impacted and even changed human behavior in almost every aspect.
Researchers have been employing big data techniques such as natural language processing, computer vision, audio signal processing, frequent pattern mining, and machine learning.
arXiv Detail & Related papers (2023-03-23T17:19:26Z) - Understanding COVID-19 News Coverage using Medical NLP [5.161531917413708]
The dataset includes more than 36,000 articles, analyzed using the clinical and biomedical Natural Language Processing (NLP) models from the Spark NLP for Healthcare library.
The analysis covers key entities and phrases, observed biases, and change over time in news coverage.
Another analysis is of extracted Adverse Drug Events about drug and vaccine manufacturers, which when reported by major news outlets has an impact on vaccine hesitancy.
arXiv Detail & Related papers (2022-03-19T15:07:46Z) - COVIDx-US -- An open-access benchmark dataset of ultrasound imaging data
for AI-driven COVID-19 analytics [116.6248556979572]
COVIDx-US is an open-access benchmark dataset of COVID-19 related ultrasound imaging data.
It consists of 93 lung ultrasound videos and 10,774 processed images of patients infected with SARS-CoV-2 pneumonia, non-SARS-CoV-2 pneumonia, as well as healthy control cases.
arXiv Detail & Related papers (2021-03-18T03:31:33Z) - Understanding the temporal evolution of COVID-19 research through
machine learning and natural language processing [66.63200823918429]
The outbreak of the novel coronavirus disease 2019 (COVID-19), caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been continuously affecting human lives and communities around the world.
We used multiple data sources, i.e., PubMed and ArXiv, and built several machine learning models to characterize the landscape of current COVID-19 research.
Our findings confirm the types of research available in PubMed and ArXiv differ significantly, with the former exhibiting greater diversity in terms of COVID-19 related issues.
arXiv Detail & Related papers (2020-07-22T18:02:39Z) - Pandemic Pulse: Unraveling and Modeling Social Signals during the
COVID-19 Pandemic [12.050597862123313]
We present and begin to explore a collection of social data that represents part of the COVID-19 pandemic's effects on the United States.
This data is collected from a range of sources and includes longitudinal trends of news topics, social distancing behaviors, community mobility changes, web searches, and more.
arXiv Detail & Related papers (2020-06-10T17:55:44Z) - Cross-lingual Transfer Learning for COVID-19 Outbreak Alignment [90.12602012910465]
We train on Italy's early COVID-19 outbreak through Twitter and transfer to several other countries.
Our experiments show strong results with up to 0.85 Spearman correlation in cross-country predictions.
arXiv Detail & Related papers (2020-06-05T02:04:25Z) - Mapping the Landscape of Artificial Intelligence Applications against
COVID-19 [59.30734371401316]
COVID-19, the disease caused by the SARS-CoV-2 virus, has been declared a pandemic by the World Health Organization.
We present an overview of recent studies using Machine Learning and, more broadly, Artificial Intelligence to tackle many aspects of the COVID-19 crisis.
arXiv Detail & Related papers (2020-03-25T12:30:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.