Fugu-MT 論文翻訳(概要): Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA

論文の概要: Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA

arxiv url: http://arxiv.org/abs/2510.13856v1
Date: Sun, 12 Oct 2025 07:03:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-17 21:15:14.481482
Title: Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA
Title（参考訳）: 医療用VQAのための大規模言語モデルを用いたマルチモーダル検索型生成
Authors: A H M Rezaul Karim, Ozlem Uzuner,
Abstract要約: MedVQA (Medicical Visual Question Answering) は、医療画像上の自然言語クエリーを、臨床的な意思決定と患者医療を支援する。本稿では,汎用的な命令調整型大規模言語モデルと検索拡張生成(RAG)フレームワークを用いたMasonNLPシステムを提案する。 19チーム中3位、51チームが平均41.37%の成績を残した。
参考スコア（独自算出の注目度）: 0.6015898117103068
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care. The MEDIQA-WV 2025 shared task addressed wound-care VQA, requiring systems to generate free-text responses and structured wound attributes from images and patient queries. We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework that incorporates textual and visual examples from in-domain data. This approach grounds outputs in clinically relevant exemplars, improving reasoning, schema adherence, and response quality across dBLEU, ROUGE, BERTScore, and LLM-based metrics. Our best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%, demonstrating that lightweight RAG with general-purpose LLMs -- a minimal inference-time layer that adds a few relevant exemplars via simple indexing and fusion, with no extra training or complex re-ranking -- provides a simple and effective baseline for multimodal clinical NLP tasks.
Abstract（参考訳）: MedVQA (Medicical Visual Question Answering) は、医療画像上の自然言語クエリーを、臨床的な意思決定と患者医療を支援する。 MEDIQA-WV 2025は創傷治療VQAに対処するタスクを共有しており、システムは画像や患者のクエリから自由テキスト応答と構造化された創傷特性を生成する必要がある。本稿では,テキストと視覚をドメイン内のデータから組み込んだ検索拡張生成(RAG)フレームワークを備えた汎用的な命令調整型大規模言語モデルを用いたMasonNLPシステムを提案する。このアプローチは、臨床的に関係のある例で、dBLEU, ROUGE, BERTScore, LLMベースのメトリクス間での推論、スキーマ順守、応答品質を改善する。当社の最高のパフォーマンスシステムは、19チーム中3位、51人が平均41.37%のスコアで、汎用LSMを使った軽量なRAG -- 単純なインデックス付けと融合を通じていくつかの関連する例を追加し、追加のトレーニングや複雑な再ランク付けを行わずに -- が、マルチモーダルなNLPタスクのためのシンプルで効果的なベースラインを提供する、という結果です。

論文の概要: Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA

関連論文リスト