Fugu-MT 論文翻訳(概要): Rethinking Patient Education as Multi-turn Multi-modal Interaction

論文の概要: Rethinking Patient Education as Multi-turn Multi-modal Interaction

arxiv url: http://arxiv.org/abs/2604.14656v1
Date: Thu, 16 Apr 2026 06:06:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:31.751581
Title: Rethinking Patient Education as Multi-turn Multi-modal Interaction
Title（参考訳）: マルチターンマルチモーダルインタラクションとしての患者教育の再考
Authors: Zonghai Yao, Zhipeng Tang, Chengtao Lin, Xiong Luo, Benlu Wang, Juncheng Huang, Chin Siang Ong, Hong Yu,
Abstract要約: MedImageEduはマルチターン・エビデンス・グラウンドド・ラジオロジー患者教育のためのベンチマークである。 DoctorAgentはPatentAgentと対話し、教育レベル、健康リテラシー、パーソナリティなどの要因をキャプチャーする。患者の質問が視覚的サポートの恩恵を受ける場合、DoctorAgentは、レポート、ケースイメージ、そして現在の質問を、ベンチマークが提供する描画ツールに発行することができる。このツールはイメージ(s)を返すが、その後DoctorAgentはイメージ(s)と接地された平易な説明からなる最終的なマルチモーダル応答を生成する。
参考スコア（独自算出の注目度）: 8.98413612284677
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most medical multimodal benchmarks focus on static tasks such as image question answering, report generation, and plain-language rewriting. Patient education is more demanding: systems must identify relevant evidence across images, show patients where to look, explain findings in accessible language, and handle confusion or distress. Yet most patient education work remains text-only, even though combined image-and-text explanations may better support understanding. We introduce MedImageEdu, a benchmark for multi-turn, evidence-grounded radiology patient education. Each case provides a radiology report with report text and case images. A DoctorAgent interacts with a PatientAgent, conditioned on a hidden profile that captures factors such as education level, health literacy, and personality. When a patient question would benefit from visual support, the DoctorAgent can issue drawing instructions grounded in the report, case images, and the current question to a benchmark-provided drawing tool. The tool returns image(s), after which the DoctorAgent produces a final multimodal response consisting of the image(s) and a grounded plain-language explanation. MedImageEdu contains 150 cases from three sources and evaluates both the consultation process and the final multimodal response along five dimensions: Consultation, Safety and Scope, Language Quality, Drawing Quality, and Image-Text Response Quality. Across representative open- and closed-source vision-language model agents, we find three consistent gaps: fluent language often outpaces faithful visual grounding, safety is the weakest dimension across disease categories, and emotionally tense interactions are harder than low education or low health literacy. MedImageEdu provides a controlled testbed for assessing whether multimodal agents can teach from evidence rather than merely answer from text.
Abstract（参考訳）: ほとんどの医療マルチモーダルベンチマークは、画像質問応答、レポート生成、平易な言語書き換えなどの静的タスクに焦点を当てている。患者教育はより要求され、システムは画像全体で関連する証拠を特定し、どこに見るべきかを示し、アクセス可能な言語で発見を説明し、混乱や苦痛に対処する必要がある。しかし、ほとんどの患者教育作業はテキストのみであり、画像とテキストによる説明の組み合わせは理解の助けになるかもしれない。 MedImageEduはマルチターン・エビデンス・グラウンドド・ラジオロジー患者教育のためのベンチマークである。各ケースは、レポートテキストとケースイメージを備えた放射線学レポートを提供する。 DoctorAgentはPatentAgentと対話し、教育レベル、健康リテラシー、パーソナリティなどの要因をキャプチャーする。患者の質問が視覚的サポートの恩恵を受ける場合、DoctorAgentは、レポート、ケースイメージ、そして現在の質問を、ベンチマークが提供する描画ツールに発行することができる。このツールはイメージ(s)を返すが、その後DoctorAgentはイメージ(s)と接地された平易な説明からなる最終的なマルチモーダル応答を生成する。 MedImageEduには3つのソースから150のケースが含まれており、コンサルティング、安全、スコープ、言語品質、描画品質、画像-テキスト応答品質という5つの側面に沿って、コンサルティングプロセスと最終マルチモーダル応答の両方を評価している。オープンおよびクローズドソースの視覚言語モデルエージェント全体で、3つの一貫したギャップが見つかる: 流動言語は、しばしば忠実な視覚的基盤を上回り、安全は病気のカテゴリーの中で最も弱い次元であり、感情的に緊張する相互作用は、低教育や低健康リテラシーよりも難しい。 MedImageEduは、テキストからのみ答えるのではなく、エビデンスからマルチモーダルエージェントが教えられるかどうかを評価するためのコントロールされたテストベッドを提供する。

論文の概要: Rethinking Patient Education as Multi-turn Multi-modal Interaction

関連論文リスト