Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting
- URL: http://arxiv.org/abs/2503.04013v1
- Date: Thu, 06 Mar 2025 02:01:59 GMT
- Title: Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting
- Authors: Jiyue Jiang, Pengan Chen, Jiuming Wang, Dongchen He, Ziqin Wei, Liang Hong, Licheng Zong, Sheng Wang, Qinze Yu, Zixian Ma, Yanyu Chen, Yimin Fan, Xiangyu Shi, Jiawei Sun, Chuan Wu, Yu Li,
- Abstract summary: Large language models (LLMs) have become important tools in solving biological problems.<n>We introduce a comprehensive prompting-based benchmarking framework, termed Bio-benchmark.<n>We evaluate six mainstream LLMs, including GPT-4o and Llama-3.1-70b, using 0-shot and few-shot Chain-of-Thought settings.
- Score: 17.973195066083797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have become important tools in solving biological problems, offering improvements in accuracy and adaptability over conventional methods. Several benchmarks have been proposed to evaluate the performance of these LLMs. However, current benchmarks can hardly evaluate the performance of these models across diverse tasks effectively. In this paper, we introduce a comprehensive prompting-based benchmarking framework, termed Bio-benchmark, which includes 30 key bioinformatics tasks covering areas such as proteins, RNA, drugs, electronic health records, and traditional Chinese medicine. Using this benchmark, we evaluate six mainstream LLMs, including GPT-4o and Llama-3.1-70b, etc., using 0-shot and few-shot Chain-of-Thought (CoT) settings without fine-tuning to reveal their intrinsic capabilities. To improve the efficiency of our evaluations, we demonstrate BioFinder, a new tool for extracting answers from LLM responses, which increases extraction accuracy by round 30% compared to existing methods. Our benchmark results show the biological tasks suitable for current LLMs and identify specific areas requiring enhancement. Furthermore, we propose targeted prompt engineering strategies for optimizing LLM performance in these contexts. Based on these findings, we provide recommendations for the development of more robust LLMs tailored for various biological applications. This work offers a comprehensive evaluation framework and robust tools to support the application of LLMs in bioinformatics.
Related papers
- Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.<n>LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.<n>We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.
In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.
This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - A Survey of Small Language Models [104.80308007044634]
Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources.
We present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques.
arXiv Detail & Related papers (2024-10-25T23:52:28Z) - LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction [13.965777046473885]
Large Language Models (LLMs) are increasingly adopted for applications in healthcare.
It is unclear how well LLMs perform on tasks that are traditionally pursued in the biomedical domain.
arXiv Detail & Related papers (2024-08-22T09:37:40Z) - SeRTS: Self-Rewarding Tree Search for Biomedical Retrieval-Augmented Generation [50.26966969163348]
Large Language Models (LLMs) have shown great potential in the biomedical domain with the advancement of retrieval-augmented generation (RAG)
Existing retrieval-augmented approaches face challenges in addressing diverse queries and documents, particularly for medical knowledge queries.
We propose Self-Rewarding Tree Search (SeRTS) based on Monte Carlo Tree Search (MCTS) and a self-rewarding paradigm.
arXiv Detail & Related papers (2024-06-17T06:48:31Z) - BiomedRAG: A Retrieval Augmented Large Language Model for Biomedicine [19.861178160437827]
Large Language Models (LLMs) have swiftly emerged as vital resources for different applications in the biomedical and healthcare domains.
textscBiomedRAG attains superior performance across 5 biomedical NLP tasks.
textscBiomedRAG outperforms other triple extraction systems with micro-F1 scores of 81.42 and 88.83 on GIT and ChemProt corpora, respectively.
arXiv Detail & Related papers (2024-05-01T12:01:39Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions [2.5179515260542544]
Large Language Models (LLMs) have gained significant attention across academia and industry for their versatile applications in text generation, question answering, and text summarization.
To quantify the performance, it's crucial to have a comprehensive grasp of existing metrics.
This paper offers a comprehensive exploration of LLM evaluation from a metrics perspective, providing insights into the selection and interpretation of metrics currently in use.
arXiv Detail & Related papers (2024-04-14T03:54:00Z) - An Evaluation of Large Language Models in Bioinformatics Research [52.100233156012756]
We study the performance of large language models (LLMs) on a wide spectrum of crucial bioinformatics tasks.
These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems.
Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks.
arXiv Detail & Related papers (2024-02-21T11:27:31Z) - Biomedical knowledge graph-optimized prompt generation for large language models [1.6658478064349376]
Large Language Models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains like biomedicine.
Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation framework.
arXiv Detail & Related papers (2023-11-29T03:07:00Z) - BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Language Processing [10.698756010878688]
We created the BioInstruct, comprising 25,005 instructions to instruction-tune large language models (LLMs)
The instructions were created by prompting the GPT-4 language model with three-seed samples randomly drawn from an 80 human curated instructions.
We evaluated these instruction-tuned LLMs on several BioNLP tasks, which can be grouped into three major categories: question answering(QA), information extraction(IE), and text generation(GEN)
arXiv Detail & Related papers (2023-10-30T19:38:50Z) - OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models [26.590755599827993]
We present OpsEval, a comprehensive task-oriented Ops benchmark designed for large language models (LLMs)
The benchmark includes 7184 multi-choice questions and 1736 question-answering (QA) formats in English and Chinese.
To ensure the credibility of our evaluation, we invite dozens of domain experts to manually review our questions.
arXiv Detail & Related papers (2023-10-11T16:33:29Z) - LLMRec: Benchmarking Large Language Models on Recommendation Task [54.48899723591296]
The application of Large Language Models (LLMs) in the recommendation domain has not been thoroughly investigated.
We benchmark several popular off-the-shelf LLMs on five recommendation tasks, including rating prediction, sequential recommendation, direct recommendation, explanation generation, and review summarization.
The benchmark results indicate that LLMs displayed only moderate proficiency in accuracy-based tasks such as sequential and direct recommendation.
arXiv Detail & Related papers (2023-08-23T16:32:54Z) - A systematic evaluation of large language models for biomedical natural language processing: benchmarks, baselines, and recommendations [22.668383945059762]
We present a systematic evaluation of four representative Large Language Models (LLMs) across 12 BioNLP datasets.
The evaluation is conducted under four settings: zero-shot, static few-shot, dynamic K-nearest few-shot, and fine-tuning.
We compare these models against state-of-the-art (SOTA) approaches that fine-tune (domain-specific) BERT or BART models.
arXiv Detail & Related papers (2023-05-10T13:40:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.