MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed
Search Logs for Zero-shot Biomedical Information Retrieval
- URL: http://arxiv.org/abs/2307.00589v2
- Date: Wed, 4 Oct 2023 01:43:15 GMT
- Title: MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed
Search Logs for Zero-shot Biomedical Information Retrieval
- Authors: Qiao Jin, Won Kim, Qingyu Chen, Donald C. Comeau, Lana Yeganova, W.
John Wilbur, Zhiyong Lu
- Abstract summary: We introduce MedCPT, a first-of-its-kindively Contrast Pre-trained Transformer model for zero-shot semantic IR in biomedicine.
To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed.
We show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks.
- Score: 5.330363334603656
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Information retrieval (IR) is essential in biomedical knowledge acquisition
and clinical decision support. While recent progress has shown that language
model encoders perform better semantic retrieval, training such models requires
abundant query-article annotations that are difficult to obtain in biomedicine.
As a result, most biomedical IR systems only conduct lexical matching. In
response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained
Transformer model for zero-shot semantic IR in biomedicine. To train MedCPT, we
collected an unprecedented scale of 255 million user click logs from PubMed.
With such data, we use contrastive learning to train a pair of
closely-integrated retriever and re-ranker. Experimental results show that
MedCPT sets new state-of-the-art performance on six biomedical IR tasks,
outperforming various baselines including much larger models such as
GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical
article and sentence representations for semantic evaluations. As such, MedCPT
can be readily applied to various real-world biomedical IR tasks.
Related papers
- BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers [48.21255861863282]
BMRetriever is a series of dense retrievers for enhancing biomedical retrieval.
BMRetriever exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger.
arXiv Detail & Related papers (2024-04-29T05:40:08Z) - Towards Generalist Biomedical AI [28.68106423175678]
We introduce Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system.
Med-PaLM M is a large multimodal generative model that flexibly encodes and interprets biomedical data.
We conduct a radiologist evaluation of model-generated (and human) chest X-ray reports and observe encouraging performance across model scales.
arXiv Detail & Related papers (2023-07-26T17:52:22Z) - Biomedical Language Models are Robust to Sub-optimal Tokenization [30.175714262031253]
Most modern biomedical language models (LMs) are pre-trained using standard domain-specific tokenizers.
We find that pre-training a biomedical LM using a more accurate biomedical tokenizer does not improve the entity representation quality of a language model.
arXiv Detail & Related papers (2023-06-30T13:35:24Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day [85.19963303642427]
We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images.
The model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics.
This enables us to train a Large Language and Vision Assistant for BioMedicine in less than 15 hours (with eight A100s)
arXiv Detail & Related papers (2023-06-01T16:50:07Z) - BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks [68.39821375903591]
Generalist AI holds the potential to address limitations due to its versatility in interpreting different data types.
Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model.
arXiv Detail & Related papers (2023-05-26T17:14:43Z) - BiomedCLIP: a multimodal biomedical foundation model pretrained from
fifteen million scientific image-text pairs [48.376109878173956]
We present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets.
PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles.
Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing.
arXiv Detail & Related papers (2023-03-02T02:20:04Z) - Domain-Specific Pretraining for Vertical Search: Case Study on
Biomedical Literature [67.4680600632232]
Self-supervised learning has emerged as a promising direction to overcome the annotation bottleneck.
We propose a general approach for vertical search based on domain-specific pretraining.
Our system can scale to tens of millions of articles on PubMed and has been deployed as Microsoft Biomedical Search.
arXiv Detail & Related papers (2021-06-25T01:02:55Z) - Multi-Perspective Semantic Information Retrieval in the Biomedical
Domain [0.0]
Information Retrieval (IR) is the task of obtaining pieces of data (such as documents) that are relevant to a particular query or need.
Modern neural approaches pose certain advantages compared to their classical counterparts.
This work presents contributions to several aspects of the Biomedical Semantic Information Retrieval domain.
arXiv Detail & Related papers (2020-07-17T21:05:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.