LitMC-BERT: transformer-based multi-label classification of biomedical
literature with an application on COVID-19 literature curation
- URL: http://arxiv.org/abs/2204.08649v1
- Date: Tue, 19 Apr 2022 04:03:45 GMT
- Title: LitMC-BERT: transformer-based multi-label classification of biomedical
literature with an application on COVID-19 literature curation
- Authors: Qingyu Chen, Jingcheng Du, Alexis Allot, and Zhiyong Lu
- Abstract summary: This study proposes LITMC-BERT, a transformer-based multi-label classification method in biomedical literature.
It uses a shared transformer backbone for all the labels while also captures label-specific features and the correlations between label pairs.
Its micro-F1 and instance-based F1 are 5% and 4% higher than the current best results, respectively.
- Score: 6.998726118579193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid growth of biomedical literature poses a significant challenge for
curation and interpretation. This has become more evident during the COVID-19
pandemic. LitCovid, a literature database of COVID-19 related papers in PubMed,
has accumulated over 180,000 articles with millions of accesses. Approximately
10,000 new articles are added to LitCovid every month. A main curation task in
LitCovid is topic annotation where an article is assigned with up to eight
topics, e.g., Treatment and Diagnosis. The annotated topics have been widely
used both in LitCovid (e.g., accounting for ~18% of total uses) and downstream
studies such as network generation. However, it has been a primary curation
bottleneck due to the nature of the task and the rapid literature growth. This
study proposes LITMC-BERT, a transformer-based multi-label classification
method in biomedical literature. It uses a shared transformer backbone for all
the labels while also captures label-specific features and the correlations
between label pairs. We compare LITMC-BERT with three baseline models on two
datasets. Its micro-F1 and instance-based F1 are 5% and 4% higher than the
current best results, respectively, and only requires ~18% of the inference
time than the Binary BERT baseline. The related datasets and models are
available via https://github.com/ncbi/ml-transformer.
Related papers
- Improving Extraction of Clinical Event Contextual Properties from Electronic Health Records: A Comparative Study [2.0884301753594334]
This study performs a comparative analysis of various natural language models for medical text classification.
BERT outperforms Bi-LSTM models by up to 28% and the baseline BERT model by up to 16% for recall of the minority classes.
arXiv Detail & Related papers (2024-08-30T10:28:49Z) - Med-Flamingo: a Multimodal Medical Few-shot Learner [58.85676013818811]
We propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain.
Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks.
We conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app.
arXiv Detail & Related papers (2023-07-27T20:36:02Z) - PGB: A PubMed Graph Benchmark for Heterogeneous Network Representation
Learning [5.747361083768407]
We introduce PubMed Graph Benchmark (PGB), a new benchmark for evaluating heterogeneous graph embeddings for biomedical literature.
The benchmark contains rich metadata including abstract authors, citations, MeSH hierarchy, MeSH hierarchy and other information.
arXiv Detail & Related papers (2023-05-04T10:09:08Z) - BiomedCLIP: a multimodal biomedical foundation model pretrained from
fifteen million scientific image-text pairs [48.376109878173956]
We present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets.
PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles.
Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing.
arXiv Detail & Related papers (2023-03-02T02:20:04Z) - Lightweight Transformers for Clinical Natural Language Processing [9.532776962985828]
This study focuses on development of compact language models for processing clinical texts.
We developed a number of efficient lightweight clinical transformers using knowledge distillation and continual learning.
Our evaluation was done across several standard datasets and covered a wide range of clinical text-mining tasks.
arXiv Detail & Related papers (2023-02-09T16:07:31Z) - Multi-label classification for biomedical literature: an overview of the
BioCreative VII LitCovid Track for COVID-19 literature topic annotations [13.043042862575192]
The BioCreative LitCovid track calls for a community effort to tackle automated topic annotation for COVID-19 literature.
The dataset consists of over 30,000 articles with manually reviewed topics.
The highest performing submissions achieved 0.8875, 0.9181, and 0.9394 for macro F1-score, micro F1-score, and instance-based F1-score.
arXiv Detail & Related papers (2022-04-20T20:47:55Z) - Discovering Drug-Target Interaction Knowledge from Biomedical Literature [107.98712673387031]
The Interaction between Drugs and Targets (DTI) in human body plays a crucial role in biomedical science and applications.
As millions of papers come out every year in the biomedical domain, automatically discovering DTI knowledge from literature becomes an urgent demand in the industry.
We explore the first end-to-end solution for this task by using generative approaches.
We regard the DTI triplets as a sequence and use a Transformer-based model to directly generate them without using the detailed annotations of entities and relations.
arXiv Detail & Related papers (2021-09-27T17:00:14Z) - Domain-Specific Pretraining for Vertical Search: Case Study on
Biomedical Literature [67.4680600632232]
Self-supervised learning has emerged as a promising direction to overcome the annotation bottleneck.
We propose a general approach for vertical search based on domain-specific pretraining.
Our system can scale to tens of millions of articles on PubMed and has been deployed as Microsoft Biomedical Search.
arXiv Detail & Related papers (2021-06-25T01:02:55Z) - Students Need More Attention: BERT-based AttentionModel for Small Data
with Application to AutomaticPatient Message Triage [65.7062363323781]
We propose a novel framework based on BioBERT (Bidirectional Representations from Transformers forBiomedical TextMining)
We introduce Label Embeddings for Self-Attention in each layer of BERT, which we call LESA-BERT, and (ii) by distilling LESA-BERT to smaller variants, we aim to reduce overfitting and model size when working on small datasets.
As an application, our framework is utilized to build a model for patient portal message triage that classifies the urgency of a message into three categories: non-urgent, medium and urgent.
arXiv Detail & Related papers (2020-06-22T03:39:00Z) - CO-Search: COVID-19 Information Retrieval with Semantic Search, Question
Answering, and Abstractive Summarization [53.67205506042232]
CO-Search is a retriever-ranker semantic search engine designed to handle complex queries over the COVID-19 literature.
To account for the domain-specific and relatively limited dataset, we generate a bipartite graph of document paragraphs and citations.
We evaluate our system on the data of the TREC-COVID information retrieval challenge.
arXiv Detail & Related papers (2020-06-17T01:32:48Z) - Document Classification for COVID-19 Literature [15.458071120159307]
We provide an analysis of several multi-label document classification models on the LitCovid dataset.
We find that pre-trained language models fine-tuned on this dataset outperform all other baselines.
We also explore 50 errors made by the best performing models on LitCovid documents.
arXiv Detail & Related papers (2020-06-15T20:03:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.