Related papers: CaresAI at BioCreative IX Track 1 -- LLM for Biomedical QA

CaresAI at BioCreative IX Track 1 -- LLM for Biomedical QA

URL: http://arxiv.org/abs/2509.00806v1
Date: Sun, 31 Aug 2025 11:40:02 GMT
Title: CaresAI at BioCreative IX Track 1 -- LLM for Biomedical QA
Authors: Reem Abdel-Salam, Mary Adewunmi, Modinat A. Abayomi,
Abstract summary: Large language models (LLMs) are increasingly evident for accurate question answering across various domains.<n>This paper presents our approach to the MedHopQA track of the BioCreative IX shared task.<n>Three experimental setups are explored: fine-tuning on combined short and long answers, short answers only, and long answers only.
Score: 3.222047196930981
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly evident for accurate question answering across various domains. However, rigorous evaluation of their performance on complex question-answering (QA) capabilities is essential before deployment in real-world biomedical and healthcare applications. This paper presents our approach to the MedHopQA track of the BioCreative IX shared task, which focuses on multi-hop biomedical question answering involving diseases, genes, and chemicals. We adopt a supervised fine-tuning strategy leveraging LLaMA 3 8B, enhanced with a curated biomedical question-answer dataset compiled from external sources including BioASQ, MedQuAD, and TREC. Three experimental setups are explored: fine-tuning on combined short and long answers, short answers only, and long answers only. While our models demonstrate strong domain understanding, achieving concept-level accuracy scores of up to 0.8, their Exact Match (EM) scores remain significantly lower, particularly in the test phase. We introduce a two-stage inference pipeline for precise short-answer extraction to mitigate verbosity and improve alignment with evaluation metrics. Despite partial improvements, challenges persist in generating strictly formatted outputs. Our findings highlight the gap between semantic understanding and exact answer evaluation in biomedical LLM applications, motivating further research in output control and post-processing strategies.

Related papers

BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality, Robustness, and Bias in Large Language Models [7.8780007697387235]
We introduce BioPulse-QA, a benchmark that evaluates large language models (LLMs) on answering questions from newly published biomedical documents.<n>We evaluate four LLMs - GPT-o1, GPT-o1, Gemini-2.0-Flash, and LLaMA-3.1 8B Instruct - released prior to the publication dates of the benchmark documents.
arXiv Detail & Related papers (2026-01-19T00:38:33Z)
UETQuintet at BioCreative IX - MedHopQA: Enhancing Biomedical QA with Selective Multi-hop Reasoning and Contextual Retrieval [0.0]
We propose a model designed to effectively address both direct and sequential questions.<n>We leverage multi-source information retrieval and in-context learning to provide rich, relevant context for generating answers.<n>Our approach achieves an Exact Match score of 0.84, ranking second on the current leaderboard.
arXiv Detail & Related papers (2026-01-11T16:12:38Z)
BioMedSearch: A Multi-Source Biomedical Retrieval Framework Based on LLMs [8.505934574757587]
We present BioMedSearch, a biomedical information retrieval framework based on large language models (LLMs)<n>The method integrates literature retrieval, protein database and web search access to support accurate and efficient handling of complex biomedical queries.<n>To evaluate the accuracy of question answering, we constructed a multi-level dataset, BioMedMCQs, consisting of 3,000 questions.
arXiv Detail & Related papers (2025-10-15T13:01:31Z)
LLM Ensemble for RAG: Role of Context Length in Zero-Shot Question Answering for BioASQ Challenge [0.03437656066916039]
Large language models (LLMs) can be used for information retrieval.<n> ensembles of zero-shot models can accomplish state-of-the-art performance on a domain-specific Yes/No QA task.
arXiv Detail & Related papers (2025-09-10T13:50:49Z)
Advancing AI Research Assistants with Expert-Involved Learning [84.30323604785646]
Large language models (LLMs) and large multimodal models (LMMs) promise to accelerate biomedical discovery, yet their reliability remains unclear.<n>We introduce ARIEL (AI Research Assistant for Expert-in-the-Loop Learning), an open-source evaluation and optimization framework.<n>We find that state-of-the-art models generate fluent but incomplete summaries, whereas LMMs struggle with detailed visual reasoning.
arXiv Detail & Related papers (2025-05-03T14:21:48Z)
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research [57.61445960384384]
MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities.<n> Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53%.<n>Expert analysis of chain-of-thought responses shows perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors.
arXiv Detail & Related papers (2025-03-17T17:33:10Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation [0.0]
We introduce MedBioLM, a domain-adapted biomedical question-answering model.<n>By integrating fine-tuning and retrieval-augmented generation (RAG), MedBioLM dynamically incorporates domain-specific knowledge.<n>Fine-tuning significantly improves accuracy on benchmark datasets, while RAG enhances factual consistency.
arXiv Detail & Related papers (2025-02-05T08:58:35Z)
LLM-MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models [18.6994780408699]
Large Language Models (LLMs) face significant challenges in medical question answering.<n>We propose a novel approach incorporating similar case generation within a multi-agent medical question-answering system.<n>Our method capitalizes on the model's inherent medical knowledge and reasoning capabilities, eliminating the need for additional training data.
arXiv Detail & Related papers (2024-12-31T19:55:45Z)
NeuroSym-BioCAT: Leveraging Neuro-Symbolic Methods for Biomedical Scholarly Document Categorization and Question Answering [0.14999444543328289]
We introduce a novel approach that integrates an optimized topic modelling framework, OVB-LDA, with the BI-POP CMA-ES optimization technique for enhanced scholarly document abstract categorization. We employ the distilled MiniLM model, fine-tuned on domain-specific data, for high-precision answer extraction.
arXiv Detail & Related papers (2024-10-29T14:45:12Z)
ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab [67.24684071577211]
The challenge of replicating research results has posed a significant impediment to the field of molecular biology. We first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective. Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings.
arXiv Detail & Related papers (2023-11-01T14:44:01Z)
BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks [68.39821375903591]
Generalist AI holds the potential to address limitations due to its versatility in interpreting different data types. Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model.
arXiv Detail & Related papers (2023-05-26T17:14:43Z)
Rewire-then-Probe: A Contrastive Recipe for Probing Biomedical Knowledge of Pre-trained Language Models [16.535312449449165]
We release a well-curated biomedical knowledge probing benchmark, MedLAMA, based on the Unified Medical Language System (UMLS) Metathesaurus. We test a wide spectrum of state-of-the-art PLMs and probing approaches on our benchmark, reaching at most 3% of acc@10. We propose Contrastive-Probe, a novel self-supervised contrastive probing approach, that adjusts the underlying PLMs without using any probing data.
arXiv Detail & Related papers (2021-10-15T16:00:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.