Domain-Specific Pretraining for Vertical Search: Case Study on
Biomedical Literature
- URL: http://arxiv.org/abs/2106.13375v1
- Date: Fri, 25 Jun 2021 01:02:55 GMT
- Title: Domain-Specific Pretraining for Vertical Search: Case Study on
Biomedical Literature
- Authors: Yu Wang, Jinchao Li, Tristan Naumann, Chenyan Xiong, Hao Cheng, Robert
Tinn, Cliff Wong, Naoto Usuyama, Richard Rogahn, Zhihong Shen, Yang Qin, Eric
Horvitz, Paul N. Bennett, Jianfeng Gao, Hoifung Poon
- Abstract summary: Self-supervised learning has emerged as a promising direction to overcome the annotation bottleneck.
We propose a general approach for vertical search based on domain-specific pretraining.
Our system can scale to tens of millions of articles on PubMed and has been deployed as Microsoft Biomedical Search.
- Score: 67.4680600632232
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Information overload is a prevalent challenge in many high-value domains. A
prominent case in point is the explosion of the biomedical literature on
COVID-19, which swelled to hundreds of thousands of papers in a matter of
months. In general, biomedical literature expands by two papers every minute,
totalling over a million new papers every year. Search in the biomedical realm,
and many other vertical domains is challenging due to the scarcity of direct
supervision from click logs. Self-supervised learning has emerged as a
promising direction to overcome the annotation bottleneck. We propose a general
approach for vertical search based on domain-specific pretraining and present a
case study for the biomedical domain. Despite being substantially simpler and
not using any relevance labels for training or development, our method performs
comparably or better than the best systems in the official TREC-COVID
evaluation, a COVID-related biomedical search competition. Using distributed
computing in modern cloud infrastructure, our system can scale to tens of
millions of articles on PubMed and has been deployed as Microsoft Biomedical
Search, a new search experience for biomedical literature:
https://aka.ms/biomedsearch.
Related papers
- A survey of recent methods for addressing AI fairness and bias in
biomedicine [48.46929081146017]
Artificial intelligence systems may perpetuate social inequities or demonstrate biases, such as those based on race or gender.
We surveyed recent publications on different debiasing methods in the fields of biomedical natural language processing (NLP) or computer vision (CV)
We performed a literature search on PubMed, ACM digital library, and IEEE Xplore of relevant articles published between January 2018 and December 2023 using multiple combinations of keywords.
We reviewed other potential methods from the general domain that could be applied to biomedicine to address bias and improve fairness.
arXiv Detail & Related papers (2024-02-13T06:38:46Z) - MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed
Search Logs for Zero-shot Biomedical Information Retrieval [5.330363334603656]
We introduce MedCPT, a first-of-its-kindively Contrast Pre-trained Transformer model for zero-shot semantic IR in biomedicine.
To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed.
We show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks.
arXiv Detail & Related papers (2023-07-02T15:11:59Z) - LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day [85.19963303642427]
We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images.
The model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics.
This enables us to train a Large Language and Vision Assistant for BioMedicine in less than 15 hours (with eight A100s)
arXiv Detail & Related papers (2023-06-01T16:50:07Z) - BiomedCLIP: a multimodal biomedical foundation model pretrained from
fifteen million scientific image-text pairs [48.376109878173956]
We present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets.
PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles.
Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing.
arXiv Detail & Related papers (2023-03-02T02:20:04Z) - BioGPT: Generative Pre-trained Transformer for Biomedical Text
Generation and Mining [140.61707108174247]
We propose BioGPT, a domain-specific generative Transformer language model pre-trained on large scale biomedical literature.
We get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks respectively, and 78.2% accuracy on PubMedQA.
arXiv Detail & Related papers (2022-10-19T07:17:39Z) - Multi-label classification for biomedical literature: an overview of the
BioCreative VII LitCovid Track for COVID-19 literature topic annotations [13.043042862575192]
The BioCreative LitCovid track calls for a community effort to tackle automated topic annotation for COVID-19 literature.
The dataset consists of over 30,000 articles with manually reviewed topics.
The highest performing submissions achieved 0.8875, 0.9181, and 0.9394 for macro F1-score, micro F1-score, and instance-based F1-score.
arXiv Detail & Related papers (2022-04-20T20:47:55Z) - Anomaly Detection in Medical Imaging -- A Mini Review [0.8122270502556374]
This paper uses a semi-exhaustive literature review of relevant anomaly detection papers in medical imaging to cluster into applications.
The main results showed that the current research is mostly motivated by reducing the need for labelled data.
Also, the successful and substantial amount of research in the brain MRI domain shows the potential for applications in further domains like OCT and chest X-ray.
arXiv Detail & Related papers (2021-08-25T11:45:40Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Low Resource Recognition and Linking of Biomedical Concepts from a Large
Ontology [30.324906836652367]
PubMed, the most well known database of biomedical papers, relies on human curators to add these annotations.
Our approach achieves new state-of-the-art results for the UMLS in both traditional recognition/linking and semantic indexing-based evaluation.
arXiv Detail & Related papers (2021-01-26T06:41:12Z) - Medical Deep Learning -- A systematic Meta-Review [0.4256574128156698]
Deep learning (DL) has impacted several different scientific disciplines over the last few years.
DL has delivered state-of-the-art results in tasks like autonomous driving, outclassing previous attempts.
With the collection of large quantities of patient records and data, there is a great need for automated and reliable processing and analysis of health information.
arXiv Detail & Related papers (2020-10-28T11:01:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.