Multi-label classification for biomedical literature: an overview of the
BioCreative VII LitCovid Track for COVID-19 literature topic annotations
- URL: http://arxiv.org/abs/2204.09781v1
- Date: Wed, 20 Apr 2022 20:47:55 GMT
- Title: Multi-label classification for biomedical literature: an overview of the
BioCreative VII LitCovid Track for COVID-19 literature topic annotations
- Authors: Qingyu Chen, Alexis Allot, Robert Leaman, Rezarta Islamaj Do\u{g}an,
Jingcheng Du, Li Fang, Wang Kai, Shuo Xu, Yuefu Zhang, Parsa Bagherzadeh,
Sabine Bergler, Aakash Bhatnagar, Nidhir Bhavsar, Yung-Chun Chang, Sheng-Jie
Lin, Wentai Tang, Hongtong Zhang, Ilija Tavchioski, Shubo Tian, Jinfeng
Zhang, Yulia Otmakhova, Antonio Jimeno Yepes, Hang Dong, Honghan Wu, Richard
Dufour, Yanis Labrak, Niladri Chatterjee, Kushagri Tandon, Fr\'ejus Laleye,
Lo\"ic Rakotoson, Emmanuele Chersoni, Jinghang Gu, Annemarie Friedrich,
Subhash Chandra Pujari, Mariia Chizhikova, Naveen Sivadasan, Naveen
Sivadasan, Zhiyong Lu
- Abstract summary: The BioCreative LitCovid track calls for a community effort to tackle automated topic annotation for COVID-19 literature.
The dataset consists of over 30,000 articles with manually reviewed topics.
The highest performing submissions achieved 0.8875, 0.9181, and 0.9394 for macro F1-score, micro F1-score, and instance-based F1-score.
- Score: 13.043042862575192
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The COVID-19 pandemic has been severely impacting global society since
December 2019. Massive research has been undertaken to understand the
characteristics of the virus and design vaccines and drugs. The related
findings have been reported in biomedical literature at a rate of about 10,000
articles on COVID-19 per month. Such rapid growth significantly challenges
manual curation and interpretation. For instance, LitCovid is a literature
database of COVID-19-related articles in PubMed, which has accumulated more
than 200,000 articles with millions of accesses each month by users worldwide.
One primary curation task is to assign up to eight topics (e.g., Diagnosis and
Treatment) to the articles in LitCovid. Despite the continuing advances in
biomedical text mining methods, few have been dedicated to topic annotations in
COVID-19 literature. To close the gap, we organized the BioCreative LitCovid
track to call for a community effort to tackle automated topic annotation for
COVID-19 literature. The BioCreative LitCovid dataset, consisting of over
30,000 articles with manually reviewed topics, was created for training and
testing. It is one of the largest multilabel classification datasets in
biomedical scientific literature. 19 teams worldwide participated and made 80
submissions in total. Most teams used hybrid systems based on transformers. The
highest performing submissions achieved 0.8875, 0.9181, and 0.9394 for macro
F1-score, micro F1-score, and instance-based F1-score, respectively. The level
of participation and results demonstrate a successful track and help close the
gap between dataset curation and method development. The dataset is publicly
available via https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/biocreative/ for
benchmarking and further development.
Related papers
- SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - BioGPT: Generative Pre-trained Transformer for Biomedical Text
Generation and Mining [140.61707108174247]
We propose BioGPT, a domain-specific generative Transformer language model pre-trained on large scale biomedical literature.
We get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks respectively, and 78.2% accuracy on PubMedQA.
arXiv Detail & Related papers (2022-10-19T07:17:39Z) - LitMC-BERT: transformer-based multi-label classification of biomedical
literature with an application on COVID-19 literature curation [6.998726118579193]
This study proposes LITMC-BERT, a transformer-based multi-label classification method in biomedical literature.
It uses a shared transformer backbone for all the labels while also captures label-specific features and the correlations between label pairs.
Its micro-F1 and instance-based F1 are 5% and 4% higher than the current best results, respectively.
arXiv Detail & Related papers (2022-04-19T04:03:45Z) - Domain-Specific Pretraining for Vertical Search: Case Study on
Biomedical Literature [67.4680600632232]
Self-supervised learning has emerged as a promising direction to overcome the annotation bottleneck.
We propose a general approach for vertical search based on domain-specific pretraining.
Our system can scale to tens of millions of articles on PubMed and has been deployed as Microsoft Biomedical Search.
arXiv Detail & Related papers (2021-06-25T01:02:55Z) - Navigating the landscape of COVID-19 research through literature
analysis: A bird's eye view [11.362549790802483]
We analyze the LitCovid collection, 13,369 COVID-19 related articles found in PubMed as of May 15th, 2020.
We do that by applying state-of-the-art named entity recognition, classification, clustering and other NLP techniques.
Our clustering algorithm identifies topics represented by groups of related terms, and computes clusters corresponding to documents associated with the topic terms.
arXiv Detail & Related papers (2020-08-07T23:39:29Z) - A System for Worldwide COVID-19 Information Aggregation [92.60866520230803]
We build a system for worldwide COVID-19 information aggregation containing reliable articles from 10 regions in 7 languages sorted by topics.
A neural machine translation module translates articles in other languages into Japanese and English.
A BERT-based topic-classifier trained on our article-topic pair dataset helps users find their interested information efficiently.
arXiv Detail & Related papers (2020-07-28T01:33:54Z) - Coronavirus Knowledge Graph: A Case Study [4.646516629534201]
We use several Machine Learning, Deep Learning, and Knowledge Graph construction and mining techniques to identify COVID-19 related experts and bio-entities.
We suggest possible techniques to predict related diseases, drug candidates, gene, gene mutations, and related compounds.
arXiv Detail & Related papers (2020-07-04T03:55:31Z) - COVID-19 Literature Knowledge Graph Construction and Drug Repurposing
Report Generation [79.33545724934714]
We have developed a novel and comprehensive knowledge discovery framework, COVID-KG, to extract fine-grained multimedia knowledge elements from scientific literature.
Our framework also provides detailed contextual sentences, subfigures, and knowledge subgraphs as evidence.
arXiv Detail & Related papers (2020-07-01T16:03:20Z) - CO-Search: COVID-19 Information Retrieval with Semantic Search, Question
Answering, and Abstractive Summarization [53.67205506042232]
CO-Search is a retriever-ranker semantic search engine designed to handle complex queries over the COVID-19 literature.
To account for the domain-specific and relatively limited dataset, we generate a bipartite graph of document paragraphs and citations.
We evaluate our system on the data of the TREC-COVID information retrieval challenge.
arXiv Detail & Related papers (2020-06-17T01:32:48Z) - Document Classification for COVID-19 Literature [15.458071120159307]
We provide an analysis of several multi-label document classification models on the LitCovid dataset.
We find that pre-trained language models fine-tuned on this dataset outperform all other baselines.
We also explore 50 errors made by the best performing models on LitCovid documents.
arXiv Detail & Related papers (2020-06-15T20:03:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.