A Comprehensive Evaluation of Large Language Models on Benchmark
Biomedical Text Processing Tasks
- URL: http://arxiv.org/abs/2310.04270v3
- Date: Mon, 19 Feb 2024 22:58:39 GMT
- Title: A Comprehensive Evaluation of Large Language Models on Benchmark
Biomedical Text Processing Tasks
- Authors: Israt Jahan, Md Tahmid Rahman Laskar, Chun Peng, Jimmy Huang
- Abstract summary: This paper aims to evaluate the performance of Large Language Models (LLM) on benchmark biomedical tasks.
To the best of our knowledge, this is the first work that conducts an extensive evaluation and comparison of various LLMs in the biomedical domain.
- Score: 2.5027382653219155
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently, Large Language Models (LLM) have demonstrated impressive capability
to solve a wide range of tasks. However, despite their success across various
tasks, no prior work has investigated their capability in the biomedical domain
yet. To this end, this paper aims to evaluate the performance of LLMs on
benchmark biomedical tasks. For this purpose, we conduct a comprehensive
evaluation of 4 popular LLMs in 6 diverse biomedical tasks across 26 datasets.
To the best of our knowledge, this is the first work that conducts an extensive
evaluation and comparison of various LLMs in the biomedical domain.
Interestingly, we find based on our evaluation that in biomedical datasets that
have smaller training sets, zero-shot LLMs even outperform the current
state-of-the-art fine-tuned biomedical models. This suggests that pretraining
on large text corpora makes LLMs quite specialized even in the biomedical
domain. We also find that not a single LLM can outperform other LLMs in all
tasks, with the performance of different LLMs may vary depending on the task.
While their performance is still quite poor in comparison to the biomedical
models that were fine-tuned on large training sets, our findings demonstrate
that LLMs have the potential to be a valuable tool for various biomedical tasks
that lack large annotated data.
Related papers
- JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models [29.92429306565324]
We propose a new benchmark for evaluating Japanese biomedical large language models (LLMs)
Experimental results indicate that:.
LLMs with a better understanding of Japanese and richer biomedical knowledge achieve better performance in Japanese biomedical tasks.
arXiv Detail & Related papers (2024-09-20T08:25:16Z) - A Survey for Large Language Models in Biomedicine [31.719451674137844]
This review is based on an analysis of 484 publications sourced from databases including PubMed, Web of Science, and arXiv.
We explore the capabilities of LLMs in zero-shot learning across a broad spectrum of biomedical tasks, including diagnostic assistance, drug discovery, and personalized medicine.
We discuss the challenges that LLMs face in the biomedicine domain including data privacy concerns, limited model interpretability, issues with dataset quality, and ethics.
arXiv Detail & Related papers (2024-08-29T12:39:16Z) - Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data [3.469567586411153]
Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data.
This study evaluates the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on a variety of clinical tasks.
arXiv Detail & Related papers (2024-08-25T13:36:22Z) - An Evaluation of Large Language Models in Bioinformatics Research [52.100233156012756]
We study the performance of large language models (LLMs) on a wide spectrum of crucial bioinformatics tasks.
These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems.
Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks.
arXiv Detail & Related papers (2024-02-21T11:27:31Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse
Biomedical Tasks [19.091278630792615]
Most existing biomedical large language models (LLMs) focus on enhancing performance in monolingual biomedical question answering and conversation tasks.
We present Taiyi, a bilingual fine-tuned LLM for diverse biomedical tasks.
arXiv Detail & Related papers (2023-11-20T08:51:30Z) - A Survey of Large Language Models in Medicine: Progress, Application, and Challenge [85.09998659355038]
Large language models (LLMs) have received substantial attention due to their capabilities for understanding and generating human language.
This review aims to provide a detailed overview of the development and deployment of LLMs in medicine.
arXiv Detail & Related papers (2023-11-09T02:55:58Z) - Large Language Models Illuminate a Progressive Pathway to Artificial
Healthcare Assistant: A Review [16.008511195589925]
Large language models (LLMs) have shown promising capabilities in mimicking human-level language comprehension and reasoning.
This paper provides a comprehensive review on the applications and implications of LLMs in medicine.
arXiv Detail & Related papers (2023-11-03T13:51:36Z) - BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS)
We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting.
Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z) - Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering (Published in Findings of EMNLP 2024) [48.17095875619711]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)
LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.
We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z) - Sentiment Analysis in the Era of Large Language Models: A Reality Check [69.97942065617664]
This paper investigates the capabilities of large language models (LLMs) in performing various sentiment analysis tasks.
We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets.
arXiv Detail & Related papers (2023-05-24T10:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.