Fugu-MT 論文翻訳(概要): Benchmarking Retrieval Strategies for Biomedical Retrieval-Augmented Generation: A Controlled Empirical Study

論文の概要: Benchmarking Retrieval Strategies for Biomedical Retrieval-Augmented Generation: A Controlled Empirical Study

arxiv url: http://arxiv.org/abs/2605.02520v1
Date: Mon, 04 May 2026 12:21:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:50.278369
Title: Benchmarking Retrieval Strategies for Biomedical Retrieval-Augmented Generation: A Controlled Empirical Study
Title（参考訳）: バイオメディカル検索強化ジェネレーションのための検索戦略のベンチマーク:制御された実証研究
Authors: Devi Prasad Bal, Subhashree Puhan,
Abstract要約: 本稿では,生物医学的質問応答RAGパイプラインにおける5つの検索戦略の体系的比較について述べる。すべての戦略は固定生成モデル(GPT-4o-mini)、共通ベクトルストア(ChromaDB)、OpenAIのテキスト-em-3-小埋め込みを共有している。 BioASQベンチマークの前処理部分集合から抽出した250の質問応答対について評価を行った。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieval-Augmented Generation (RAG) offers a well-established path to grounding large language model (LLM) outputs in external knowledge, yet the question of which retrieval strategy works best in a high-stakes domain such as biomedicine has not received the controlled, multi-metric treatment it deserves. This paper presents a systematic empirical comparison of five retrieval strategies -- Dense Vector Search, Hybrid BM25 + Dense retrieval, Cross-Encoder Reranking, Multi-Query Expansion, and Maximal Marginal Relevance (MMR) -- within a biomedical question-answering RAG pipeline. All strategies share a fixed generation model (GPT-4o-mini), a common vector store (ChromaDB), and OpenAI's text-embedding-3-small embeddings, ensuring that observed differences are attributable to retrieval alone. Evaluation is conducted on 250 question-answer pairs drawn from a preprocessed subset of the BioASQ benchmark (rag-mini-bioasq) using four DeepEval metrics: contextual precision, contextual recall, faithfulness, and answer relevancy, each reported with 95% confidence intervals. A no-context ablation is included as a lower bound. Cross-Encoder Reranking achieves the best composite score (0.827) and highest contextual precision (0.852), confirming that query-document interaction yields measurable retrieval gains. Multi-Query Expansion, despite its recall-oriented design, produces the weakest contextual precision (0.671), suggesting naive query diversification introduces retrieval noise. MMR sacrifices answer relevancy for diversity, while the Dense baseline (composite 0.822) falls within 0.005 points of the top strategy. All RAG conditions dramatically outperform the no-context ablation on answer relevancy (0.658-0.701 vs. 0.287), confirming the practical value of retrieval. The full pipeline, hyperparameters, and evaluation code are publicly available.
Abstract（参考訳）: Retrieval-Augmented Generation (RAG) は、大規模言語モデル(LLM)出力を外部知識で基礎付けるための確立された経路を提供するが、バイオメディシンのようなハイテイク領域において、どの検索戦略が最もうまく機能するかという問題は、制御されたマルチメトリックな治療を受けていない。本稿では,バイオメディカル質問応答RAGパイプラインにおいて,Dense Vector Search, Hybrid BM25 + Dense Search, Cross-Encoder Re ranking, Multi-Query Expansion, Maximal Marginal Relevance (MMR) という5つの検索手法の体系的比較を行った。すべての戦略は、固定生成モデル(GPT-4o-mini)、共通ベクトルストア(ChromaDB)、OpenAIのテキスト埋め込み3小埋め込みを共有し、観察された違いが検索のみに起因することを保証している。評価は,BioASQベンチマーク(rag-mini-bioasq)の事前処理したサブセットから,文脈精度,文脈的リコール,忠実度,回答関連性という4つのDeepEval指標を用いて,250対の質問応答ペアを用いて行われ,それぞれ95%の信頼区間が報告されている。非コンテキストアブレーションは、下限として含まれる。クロスエンコーダ・リグレードは、最高の合成スコア(0.827)と最高文脈精度(0.852)を達成し、クエリ-ドキュメント間相互作用が測定可能な検索ゲインをもたらすことを確認した。マルチクエリ拡張(Multi-Query Expansion)は、リコール指向の設計にもかかわらず、最も弱いコンテキスト精度(0.671)を生成する。 MMRは多様性の関連性を犠牲にし、Denseベースライン (composite 0.822) はトップ戦略の0.005ポイント以内である。すべてのRAG条件は、応答関連性(0.658-0.701 vs. 0.287)の非文脈アブレーションを劇的に上回り、検索の実用的価値を確認した。完全なパイプライン、ハイパーパラメータ、評価コードが公開されている。

論文の概要: Benchmarking Retrieval Strategies for Biomedical Retrieval-Augmented Generation: A Controlled Empirical Study

関連論文リスト