Efficient and Reproducible Biomedical Question Answering using Retrieval Augmented Generation
- URL: http://arxiv.org/abs/2505.07917v1
- Date: Mon, 12 May 2025 14:51:47 GMT
- Title: Efficient and Reproducible Biomedical Question Answering using Retrieval Augmented Generation
- Authors: Linus Stuhlmann, Michael Alexander Saxer, Jonathan Fürst,
- Abstract summary: This study systematically examines a Retrieval-Augmented Generation (RAG) system for biomedical QA.<n>We first assess state-of-the-art retrieval methods, including BM25, BioBERT, MedCPT, and a hybrid approach.<n>We deploy the final RAG system on the full 24M PubMed corpus, comparing different retrievers' impact on overall performance.
- Score: 1.6336299799055092
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Biomedical question-answering (QA) systems require effective retrieval and generation components to ensure accuracy, efficiency, and scalability. This study systematically examines a Retrieval-Augmented Generation (RAG) system for biomedical QA, evaluating retrieval strategies and response time trade-offs. We first assess state-of-the-art retrieval methods, including BM25, BioBERT, MedCPT, and a hybrid approach, alongside common data stores such as Elasticsearch, MongoDB, and FAISS, on a ~10% subset of PubMed (2.4M documents) to measure indexing efficiency, retrieval latency, and retriever performance in the end-to-end RAG system. Based on these insights, we deploy the final RAG system on the full 24M PubMed corpus, comparing different retrievers' impact on overall performance. Evaluations of the retrieval depth show that retrieving 50 documents with BM25 before reranking with MedCPT optimally balances accuracy (0.90), recall (0.90), and response time (1.91s). BM25 retrieval time remains stable (82ms), while MedCPT incurs the main computational cost. These results highlight previously not well-known trade-offs in retrieval depth, efficiency, and scalability for biomedical QA. With open-source code, the system is fully reproducible and extensible.
Related papers
- SAGE: Benchmarking and Improving Retrieval for Deep Research Agents [60.53966065867568]
We introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus.<n>We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval.<n> BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries.
arXiv Detail & Related papers (2026-02-05T18:25:24Z) - From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark [12.595335483488052]
In medicine, large language models increasingly rely on retrieval-augmented generation (RAG) to ground outputs in up-to-date external evidence.<n>This study addresses two key gaps: (1) the lack of PICO alignment between queries and retrieved evidence, and (2) the absence of evidence hierarchy considerations during reranking.<n>We present a generalizable strategy for adapting EBM to graph-based RAG, integrating the PICO framework into knowledge graph construction and retrieval.
arXiv Detail & Related papers (2026-01-01T05:20:54Z) - MedBioRAG: Semantic Search and Retrieval-Augmented Generation with Large Language Models for Medical and Biological QA [0.0]
MedBioRAG is a retrieval-augmented model designed to improve biomedical question-answering (QA) performance.<n>We evaluate MedBioRAG across text retrieval, close-ended QA, and long-form QA tasks using benchmark datasets.
arXiv Detail & Related papers (2025-12-10T15:43:25Z) - ModernBERT + ColBERT: Enhancing biomedical RAG through an advanced re-ranking retriever [0.5371337604556311]
We develop a lightweight ModernBERT bidirectional encoder for efficient initial candidate retrieval with a ColBERTv2 late-interaction model for fine-grained re-ranking.<n>Our analysis of the retriever module confirmed the positive impact of the ColBERT re-ranker, which improved Recall@3 by up to 4.2 percentage points.<n>Our ablation studies reveal that this performance is critically dependent on a joint fine-tuning process that aligns the retriever and re-ranker.
arXiv Detail & Related papers (2025-10-06T12:34:55Z) - BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent [74.10138164281618]
BrowseComp-Plus is a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus.<n>This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods.
arXiv Detail & Related papers (2025-08-08T17:55:11Z) - MedGemma Technical Report [75.88152277443179]
We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B.<n>MedGemma demonstrates advanced medical understanding and reasoning on images and text.<n>We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP.
arXiv Detail & Related papers (2025-07-07T17:01:44Z) - Scalable Unit Harmonization in Medical Informatics via Bayesian-Optimized Retrieval and Transformer-Based Re-ranking [0.14504054468850663]
We develop a scalable methodology for harmonizing inconsistent units in large-scale clinical datasets.<n>We implement a multi-stage pipeline: filtering, identification, harmonization proposal generation, automated re-ranking, and manual validation.<n>The system achieved 83.39% precision at rank 1 and 94.66% recall at rank 5.
arXiv Detail & Related papers (2025-05-01T19:09:15Z) - DAT: Dynamic Alpha Tuning for Hybrid Retrieval in Retrieval-Augmented Generation [0.0]
DAT (Dynamic Alpha Tuning) is a novel hybrid retrieval framework that balances dense retrieval and BM25 for each query.<n>It consistently outperforms fixed-weighting hybrid retrieval methods across various evaluation metrics.<n>Even on smaller models, DAT delivers strong performance, highlighting its efficiency and adaptability.
arXiv Detail & Related papers (2025-03-29T08:35:01Z) - Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z) - Improving Bias Correction Standards by Quantifying its Effects on Treatment Outcomes [54.18828236350544]
Propensity score matching (PSM) addresses selection biases by selecting comparable populations for analysis.
Different matching methods can produce significantly different Average Treatment Effects (ATE) for the same task, even when meeting all validation criteria.
To address this issue, we introduce a novel metric, A2A, to reduce the number of valid matches.
arXiv Detail & Related papers (2024-07-20T12:42:24Z) - Large language models are good medical coders, if provided with tools [0.0]
This study presents a novel two-stage Retrieve-Rank system for automated ICD-10-CM medical coding.
evaluating both systems on a dataset of 100 single-term medical conditions.
The Retrieve-Rank system achieved 100% accuracy in predicting correct ICD-10-CM codes.
arXiv Detail & Related papers (2024-07-06T06:58:51Z) - SeRTS: Self-Rewarding Tree Search for Biomedical Retrieval-Augmented Generation [50.26966969163348]
Large Language Models (LLMs) have shown great potential in the biomedical domain with the advancement of retrieval-augmented generation (RAG)
Existing retrieval-augmented approaches face challenges in addressing diverse queries and documents, particularly for medical knowledge queries.
We propose Self-Rewarding Tree Search (SeRTS) based on Monte Carlo Tree Search (MCTS) and a self-rewarding paradigm.
arXiv Detail & Related papers (2024-06-17T06:48:31Z) - BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers [48.21255861863282]
BMRetriever is a series of dense retrievers for enhancing biomedical retrieval.
BMRetriever exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger.
arXiv Detail & Related papers (2024-04-29T05:40:08Z) - Capabilities of Gemini Models in Medicine [100.60391771032887]
We introduce Med-Gemini, a family of highly capable multimodal models specialized in medicine.
We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them.
Our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment.
arXiv Detail & Related papers (2024-04-29T04:11:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.