COVID-19 Literature Mining and Retrieval using Text Mining Approaches
- URL: http://arxiv.org/abs/2205.14781v1
- Date: Sun, 29 May 2022 22:34:19 GMT
- Title: COVID-19 Literature Mining and Retrieval using Text Mining Approaches
- Authors: Sanku Satya Uday, Satti Thanuja Pavani, T. Jaya Lakshmi, Rohit
Chivukula
- Abstract summary: The novel coronavirus disease (COVID-19) began in Wuhan, China, in late 2019 and to date has infected over 148M people worldwide.
Many academicians and researchers started to publish papers describing the latest discoveries on covid-19.
The proposed model attempts to extract relavent titles from the large corpus of research publications.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The novel coronavirus disease (COVID-19) began in Wuhan, China, in late 2019
and to date has infected over 148M people worldwide, resulting in 3.12M deaths.
On March 10, 2020, the World Health Organisation (WHO) declared it as a global
pandemic. Many academicians and researchers started to publish papers
describing the latest discoveries on covid-19. The large influx of publications
made it hard for other researchers to go through a large amount of data and
find the appropriate one that helps their research. So, the proposed model
attempts to extract relavent titles from the large corpus of research
publications which makes the job easy for the researchers. Allen Institute for
AI released the CORD-19 dataset, which consists of 2,00,000 journal articles
related to coronavirus-related research publications from PubMed's PMC, WHO
(World Health Organization), bioRxiv, and medRxiv pre-prints. Along with this
document corpus, they have also provided a topics dataset named topics-rnd3
consisting of a list of topics. Each topic has three types of representations
like query, question, and narrative. These Datasets are made open for research,
and also they released a TREC-COVID competition on Kaggle. Using these topics
like queries, our goal is to find out the relevant documents in the CORD-19
dataset. In this research, relevant documents should be recognized for the
posed topics in topics-rnd3 data set. The proposed model uses Natural Language
Processing(NLP) techniques like Bag-of-Words, Average Word-2-Vec, Average BERT
Base model and Tf-Idf weighted Word2Vec model to fabricate vectors for query,
question, narrative, and combinations of them. Similarly, fabricate vectors for
titles in the CORD-19 dataset. After fabricating vectors, cosine similarity is
used for finding similarities between every two vectors. Cosine similarity
helps us to find relevant documents for the given topic.
Related papers
- Constructing the CORD-19 Vaccine Dataset [1.986689544042807]
We introduce new dataset 'CORD-19-Vaccination' to cater to scientists specifically looking into COVID-19 vaccine-related research.
This dataset is extracted from CORD-19 dataset and augmented with new columns for language detail, author demography, keywords, and topic per paper.
arXiv Detail & Related papers (2024-07-26T02:44:55Z) - RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis [56.57177181778517]
RadGenome-Chest CT is a large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE.
We leverage the latest powerful universal segmentation and large language models to extend the original datasets.
arXiv Detail & Related papers (2024-04-25T17:11:37Z) - PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding.
LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge.
We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Multi-label classification for biomedical literature: an overview of the
BioCreative VII LitCovid Track for COVID-19 literature topic annotations [13.043042862575192]
The BioCreative LitCovid track calls for a community effort to tackle automated topic annotation for COVID-19 literature.
The dataset consists of over 30,000 articles with manually reviewed topics.
The highest performing submissions achieved 0.8875, 0.9181, and 0.9394 for macro F1-score, micro F1-score, and instance-based F1-score.
arXiv Detail & Related papers (2022-04-20T20:47:55Z) - What's New? Summarizing Contributions in Scientific Literature [85.95906677964815]
We introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work.
We extend the S2ORC corpus of academic articles by adding disentangled "contribution" and "context" reference labels.
We propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs.
arXiv Detail & Related papers (2020-11-06T02:23:01Z) - Repurposing TREC-COVID Annotations to Answer the Key Questions of
CORD-19 [4.847073702809032]
coronavirus disease 2019 (COVID-19) began in Wuhan, China in late 2019 and to date has infected over 14M people worldwide.
White House aggregated over 200,000 journal articles related to a variety of coronaviruses and tasked the community with answering key questions related to the corpus.
We set out to repurpose the relevancy annotations for TREC-COVID tasks to identify journal articles in CORD-19 which are relevant to the key questions posed by CORD-19.
arXiv Detail & Related papers (2020-08-27T19:51:07Z) - Navigating the landscape of COVID-19 research through literature
analysis: A bird's eye view [11.362549790802483]
We analyze the LitCovid collection, 13,369 COVID-19 related articles found in PubMed as of May 15th, 2020.
We do that by applying state-of-the-art named entity recognition, classification, clustering and other NLP techniques.
Our clustering algorithm identifies topics represented by groups of related terms, and computes clusters corresponding to documents associated with the topic terms.
arXiv Detail & Related papers (2020-08-07T23:39:29Z) - CO-Search: COVID-19 Information Retrieval with Semantic Search, Question
Answering, and Abstractive Summarization [53.67205506042232]
CO-Search is a retriever-ranker semantic search engine designed to handle complex queries over the COVID-19 literature.
To account for the domain-specific and relatively limited dataset, we generate a bipartite graph of document paragraphs and citations.
We evaluate our system on the data of the TREC-COVID information retrieval challenge.
arXiv Detail & Related papers (2020-06-17T01:32:48Z) - Learning Contextualized Document Representations for Healthcare Answer
Retrieval [68.02029435111193]
Contextual Discourse Vectors (CDV) is a distributed document representation for efficient answer retrieval from long documents.
Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse.
We show that our generalized model significantly outperforms several state-of-the-art baselines for healthcare passage ranking.
arXiv Detail & Related papers (2020-02-03T15:47:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.