Large Language Models, scientific knowledge and factuality: A systematic
analysis in antibiotic discovery
- URL: http://arxiv.org/abs/2305.17819v2
- Date: Tue, 5 Dec 2023 09:51:55 GMT
- Title: Large Language Models, scientific knowledge and factuality: A systematic
analysis in antibiotic discovery
- Authors: Magdalena Wysocka, Oskar Wysocki, Maxime Delmas, Vincent Mutel, Andre
Freitas
- Abstract summary: This work examines the potential of Large Language Models for dialoguing with biomedical background knowledge.
Ten state-of-the-art models are tested in two prompting-based tasks: chemical compound definition generation and chemical compound-fungus relation determination.
Results show that while recent models have improved in fluency, factual accuracy is still low and models are biased towards over-represented entities.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Inferring over and extracting information from Large Language Models (LLMs)
trained on a large corpus of scientific literature can potentially drive a new
era in biomedical research, reducing the barriers for accessing existing
medical evidence. This work examines the potential of LLMs for dialoguing with
biomedical background knowledge, using the context of antibiotic discovery. The
systematic analysis is applied to ten state-of-the-art models, from models
specialised on biomedical scientific corpora to general models such as ChatGPT,
GPT-4 and Llama 2 in two prompting-based tasks: chemical compound definition
generation and chemical compound-fungus relation determination. The work
provides a systematic assessment on the ability of LLMs to encode and express
these relations, verifying for fluency, prompt-alignment, semantic coherence,
factual knowledge and specificity of generated responses. Results show that
while recent models have improved in fluency, factual accuracy is still low and
models are biased towards over-represented entities. The ability of LLMs to
serve as biomedical knowledge bases is questioned, and the need for additional
systematic evaluation frameworks is highlighted. The best performing GPT-4
produced a factual definition for 70% of chemical compounds and 43.6% factual
relations to fungi, whereas the best open source model BioGPT-large 30% of the
compounds and 30% of the relations for the best-performing prompt. The results
show that while LLMs are currently not fit for purpose to be used as biomedical
factual knowledge bases, there is a promising emerging property in the
direction of factuality as the models become domain specialised, scale-up in
size and level of human feedback.
Related papers
- Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z) - Towards Artificial Intelligence Research Assistant for Expert-Involved Learning [64.7438151207189]
Large Language Models (LLMs) and Large Multi-Modal Models (LMMs) have emerged as transformative tools in scientific research.<n>We present textbfARtificial textbfIntelligence research assistant for textbfExpert-involved textbfLearning (ARIEL)
arXiv Detail & Related papers (2025-05-03T14:21:48Z) - m-KAILIN: Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training [8.238980609871042]
We propose a knowledge-driven, multi-agent framework for scientific corpus distillation tailored for biomedical training.
Our approach is a collaborative multi-agent architecture, where specialized agents, each guided by the Medical Subject Headings (MeSH) hierarchy, work in concert to autonomously extract, synthesize, and self-evaluate high-quality data.
arXiv Detail & Related papers (2025-04-28T08:18:24Z) - Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.
Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.
We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z) - Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored.
Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.
We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z) - Knowledge Hierarchy Guided Biological-Medical Dataset Distillation for Domain LLM Training [10.701353329227722]
We propose a framework that automates the distillation of high-quality textual training data from the extensive scientific literature.
Our approach self-evaluates and generates questions that are more closely aligned with the biomedical domain.
Our approach substantially improves question-answering tasks compared to pre-trained models from the life sciences domain.
arXiv Detail & Related papers (2025-01-25T07:20:44Z) - Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z) - Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval [61.70489848327436]
KARE is a novel framework that integrates knowledge graph (KG) community-level retrieval with large language models (LLMs) reasoning.
Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions.
arXiv Detail & Related papers (2024-10-06T18:46:28Z) - Diagnostic Reasoning in Natural Language: Computational Model and Application [68.47402386668846]
We investigate diagnostic abductive reasoning (DAR) in the context of language-grounded tasks (NL-DAR)
We propose a novel modeling framework for NL-DAR based on Pearl's structural causal models.
We use the resulting dataset to investigate the human decision-making process in NL-DAR.
arXiv Detail & Related papers (2024-09-09T06:55:37Z) - LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction [13.965777046473885]
Large Language Models (LLMs) are increasingly adopted for applications in healthcare.
It is unclear how well LLMs perform on tasks that are traditionally pursued in the biomedical domain.
arXiv Detail & Related papers (2024-08-22T09:37:40Z) - Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation [15.495976478018264]
Large language models (LLMs) have emerged as a promising tool to revolutionize knowledge interaction.
We construct a dataset of background-hypothesis pairs from biomedical literature, partitioned into training, seen, and unseen test sets.
We assess the hypothesis generation capabilities of top-tier instructed models in zero-shot, few-shot, and fine-tuning settings.
arXiv Detail & Related papers (2024-07-12T02:55:13Z) - M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering [14.198330378235632]
We use Multiple Choice and Abstractive Question Answering to conduct a large-scale empirical study on 22 datasets in three generalist and three specialist biomedical sub-domains.
Our multifaceted analysis of the performance of 15 LLMs uncovers success factors such as instruction tuning that lead to improved recall and comprehension.
We show that while recently proposed domain-adapted models may lack adequate knowledge, directly fine-tuning on our collected medical knowledge datasets shows encouraging results.
We complement the quantitative results with a skill-oriented manual error analysis, which reveals a significant gap between the models' capabilities to simply recall necessary knowledge and to integrate it with the presented
arXiv Detail & Related papers (2024-06-06T02:43:21Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - Customizing Large Language Models for Business Context: Framework and Experiments [4.922554372855655]
Large Language Models (LLMs) have ushered in a new era for design science in Information Systems.
We propose and test a novel framework to customize LLMs for general business contexts.
We instantiate our proposed framework in the context of medical consultation.
arXiv Detail & Related papers (2023-12-15T21:42:19Z) - Exploring the Cognitive Knowledge Structure of Large Language Models: An
Educational Diagnostic Assessment Approach [50.125704610228254]
Large Language Models (LLMs) have not only exhibited exceptional performance across various tasks, but also demonstrated sparks of intelligence.
Recent studies have focused on assessing their capabilities on human exams and revealed their impressive competence in different domains.
We conduct an evaluation using MoocRadar, a meticulously annotated human test dataset based on Bloom taxonomy.
arXiv Detail & Related papers (2023-10-12T09:55:45Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.