Related papers: LitMC-BERT: transformer-based multi-label classification of biomedical literature with an application on COVID-19 literature curation

LitMC-BERT: transformer-based multi-label classification of biomedical literature with an application on COVID-19 literature curation

URL: http://arxiv.org/abs/2204.08649v1
Date: Tue, 19 Apr 2022 04:03:45 GMT
Title: LitMC-BERT: transformer-based multi-label classification of biomedical literature with an application on COVID-19 literature curation
Authors: Qingyu Chen, Jingcheng Du, Alexis Allot, and Zhiyong Lu
Abstract summary: This study proposes LITMC-BERT, a transformer-based multi-label classification method in biomedical literature. It uses a shared transformer backbone for all the labels while also captures label-specific features and the correlations between label pairs. Its micro-F1 and instance-based F1 are 5% and 4% higher than the current best results, respectively.
Score: 6.998726118579193
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid growth of biomedical literature poses a significant challenge for curation and interpretation. This has become more evident during the COVID-19 pandemic. LitCovid, a literature database of COVID-19 related papers in PubMed, has accumulated over 180,000 articles with millions of accesses. Approximately 10,000 new articles are added to LitCovid every month. A main curation task in LitCovid is topic annotation where an article is assigned with up to eight topics, e.g., Treatment and Diagnosis. The annotated topics have been widely used both in LitCovid (e.g., accounting for ~18% of total uses) and downstream studies such as network generation. However, it has been a primary curation bottleneck due to the nature of the task and the rapid literature growth. This study proposes LITMC-BERT, a transformer-based multi-label classification method in biomedical literature. It uses a shared transformer backbone for all the labels while also captures label-specific features and the correlations between label pairs. We compare LITMC-BERT with three baseline models on two datasets. Its micro-F1 and instance-based F1 are 5% and 4% higher than the current best results, respectively, and only requires ~18% of the inference time than the Binary BERT baseline. The related datasets and models are available via https://github.com/ncbi/ml-transformer.

Related papers

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature [73.39593644054865]
BIOMEDICA is a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset. Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. BMCA-CLIP is a suite of CLIP-style models continuously pretrained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally.
arXiv Detail & Related papers (2025-01-13T09:58:03Z)
Improving Extraction of Clinical Event Contextual Properties from Electronic Health Records: A Comparative Study [2.0884301753594334]
This study performs a comparative analysis of various natural language models for medical text classification. BERT outperforms Bi-LSTM models by up to 28% and the baseline BERT model by up to 16% for recall of the minority classes.
arXiv Detail & Related papers (2024-08-30T10:28:49Z)
Med-Flamingo: a Multimodal Medical Few-shot Learner [58.85676013818811]
We propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain. Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks. We conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app.
arXiv Detail & Related papers (2023-07-27T20:36:02Z)
PGB: A PubMed Graph Benchmark for Heterogeneous Network Representation Learning [5.747361083768407]
We introduce PubMed Graph Benchmark (PGB), a new benchmark for evaluating heterogeneous graph embeddings for biomedical literature. The benchmark contains rich metadata including abstract authors, citations, MeSH hierarchy, MeSH hierarchy and other information.
arXiv Detail & Related papers (2023-05-04T10:09:08Z)
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs [48.376109878173956]
We present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets. PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles. Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing.
arXiv Detail & Related papers (2023-03-02T02:20:04Z)
Lightweight Transformers for Clinical Natural Language Processing [9.532776962985828]
This study focuses on development of compact language models for processing clinical texts. We developed a number of efficient lightweight clinical transformers using knowledge distillation and continual learning. Our evaluation was done across several standard datasets and covered a wide range of clinical text-mining tasks.
arXiv Detail & Related papers (2023-02-09T16:07:31Z)
Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations [13.043042862575192]
The BioCreative LitCovid track calls for a community effort to tackle automated topic annotation for COVID-19 literature. The dataset consists of over 30,000 articles with manually reviewed topics. The highest performing submissions achieved 0.8875, 0.9181, and 0.9394 for macro F1-score, micro F1-score, and instance-based F1-score.
arXiv Detail & Related papers (2022-04-20T20:47:55Z)
Discovering Drug-Target Interaction Knowledge from Biomedical Literature [107.98712673387031]
The Interaction between Drugs and Targets (DTI) in human body plays a crucial role in biomedical science and applications. As millions of papers come out every year in the biomedical domain, automatically discovering DTI knowledge from literature becomes an urgent demand in the industry. We explore the first end-to-end solution for this task by using generative approaches. We regard the DTI triplets as a sequence and use a Transformer-based model to directly generate them without using the detailed annotations of entities and relations.
arXiv Detail & Related papers (2021-09-27T17:00:14Z)
Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature [67.4680600632232]
Self-supervised learning has emerged as a promising direction to overcome the annotation bottleneck. We propose a general approach for vertical search based on domain-specific pretraining. Our system can scale to tens of millions of articles on PubMed and has been deployed as Microsoft Biomedical Search.
arXiv Detail & Related papers (2021-06-25T01:02:55Z)
Students Need More Attention: BERT-based AttentionModel for Small Data with Application to AutomaticPatient Message Triage [65.7062363323781]
We propose a novel framework based on BioBERT (Bidirectional Representations from Transformers forBiomedical TextMining) We introduce Label Embeddings for Self-Attention in each layer of BERT, which we call LESA-BERT, and (ii) by distilling LESA-BERT to smaller variants, we aim to reduce overfitting and model size when working on small datasets. As an application, our framework is utilized to build a model for patient portal message triage that classifies the urgency of a message into three categories: non-urgent, medium and urgent.
arXiv Detail & Related papers (2020-06-22T03:39:00Z)
CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization [53.67205506042232]
CO-Search is a retriever-ranker semantic search engine designed to handle complex queries over the COVID-19 literature. To account for the domain-specific and relatively limited dataset, we generate a bipartite graph of document paragraphs and citations. We evaluate our system on the data of the TREC-COVID information retrieval challenge.
arXiv Detail & Related papers (2020-06-17T01:32:48Z)
Document Classification for COVID-19 Literature [15.458071120159307]
We provide an analysis of several multi-label document classification models on the LitCovid dataset. We find that pre-trained language models fine-tuned on this dataset outperform all other baselines. We also explore 50 errors made by the best performing models on LitCovid documents.
arXiv Detail & Related papers (2020-06-15T20:03:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.