Fugu-MT 論文翻訳(概要): Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

論文の概要: Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

arxiv url: http://arxiv.org/abs/2604.19281v1
Date: Tue, 21 Apr 2026 09:50:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.709079
Title: Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications
Title（参考訳）: セマンティックな類似性を超えて:健康状態を考慮した質問応答システムのためのコンポーネントワイズ評価フレームワーク
Authors: Abu Noman Md Sakib, Md. Main Oddin Chisty, Zijie Zhang,
Abstract要約: 本稿では,VBスコア(VB-Score)と呼ばれる医療質問応答のための新しい評価フレームワークを提案する。我々は,よく知られた3つの大規模言語モデルの性能について,厳密なレビューを行う。以上の結果から,各種公衆衛生分野におけるパフォーマンス格差の懸念が示唆された。
参考スコア（独自算出の注目度）: 0.957154155094766
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model's answers match semantically, and therefore do not provide a true indication of the model's medical accuracy or of the health equity risks associated with it. To address these shortcomings, we present a new evaluation framework for medical question answering called VB-Score (Verification-Based Score) that provides a separate evaluation of the four components of entity recognition, semantic similarity, factual consistency, and structured information completeness for medical question-answering models. We perform rigorous reviews of the performance of three well-known and widely used LLMs on 48 public health-related topics taken from high-quality, authoritative information sources. Based on our analyses, we discover a major discrepancy between the models' semantic and entity accuracy. Our assessments of the performance of all three models show that each of them has almost uniformly severe performance failures when evaluated against our criteria. Our findings indicate alarming performance disparities across various public health topics, with most of the models exhibiting 13.8% lower performance (compared to an overall average) for all the public health topics that relate to chronic conditions that occur in older and minority populations, which indicates the existence of what's known as condition-based algorithmic discrimination. Our findings also demonstrate that prompt engineering alone does not compensate for basic architectural limitations on how these models perform in extracting medical entities and raise the question of whether semantic evaluation alone is a sufficient measure of medical AI safety.
Abstract（参考訳）: 医学的問題に対処する患者を支援するために,Large Language Models (LLMs) が普及しつつある。しかしながら、この文脈でこれらのモデルの性能を評価するために現在使われている指標のほとんどは、モデルの回答が意味的にどの程度近いかを測るだけであり、それゆえ、モデルの医療的正確性やそれに関連する健康上のリスクの真の表示を提供していない。これらの欠点に対処するため,本研究では,VBスコア (Verification-Based Score) と呼ばれる医療質問応答のための新しい評価フレームワークを提案する。我々は,高品質で権威のある情報ソースから得られた48の公衆衛生関連トピックに対して,よく知られた3つのLLMのパフォーマンスを厳格に評価する。分析の結果,モデルの意味的精度と実体的精度との間に大きな相違点があることが判明した。これら3モデルすべての性能評価結果から,各モデルが基準値に対してほぼ一様に厳しい性能障害を負っていることが明らかとなった。以上の結果から, 高齢者および少数民族における慢性疾患に関連するすべての公衆衛生トピックに対して, 13.8%の低いパフォーマンス(全体平均と比較して)を示すモデルが多数存在することが示唆された。また、これらのモデルが医療機関の抽出においてどのように機能するかという基本的なアーキテクチャ上の制約をエンジニアリングだけで補うことはなく、セマンティック評価だけでは医療AIの安全性の十分な尺度であるかどうかという疑問を提起する。

関連論文リスト

Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation [97.36081721024728]
本稿では,現実的な医療相談におけるマルチターンインタラクションの信頼性を評価するための最初のベンチマークを提案する。本ベンチマークでは,3種類の医療データを統合し,診断を行う。本稿では,エビデンスを基盤とした言語自己評価フレームワークであるMedConfを紹介する。
論文参考訳（メタデータ） (2026-01-22T04:51:39Z)
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models [54.48710348910535]
既存の医学推論ベンチマークは、主に1回の訪問からの画像に基づいて患者の状態を分析することに焦点を当てている。臨床訪問における患者の状態の変化を分析するための最初のベンチマークであるTemMed-Benchを紹介する。
論文参考訳（メタデータ） (2025-09-29T17:51:26Z)
Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models [57.73472878679636]
Med-RewardBenchは、医療報酬モデルと審査員を評価するために特別に設計された最初のベンチマークである。 Med-RewardBenchは、13の臓器系と8の臨床部門にまたがるマルチモーダルデータセットを特徴としている。厳格な3段階のプロセスは、6つの臨床的に重要な次元にわたる高品質な評価データを保証する。
論文参考訳（メタデータ） (2025-08-29T08:58:39Z)
Towards Domain Specification of Embedding Models in Medicine [1.0713888959520208]
MTEB(Massive Text Embedding Benchmark)に基づく分類,クラスタリング,ペア分類,検索を対象とする51タスクの総合ベンチマークスイートを提案する。以上の結果から,本手法はロバストな評価枠組みを確立し,各タスクにおける工芸品の代替品の状態を常に上回り,組込み性能が向上することを示した。
論文参考訳（メタデータ） (2025-07-25T16:15:00Z)
How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study [16.84832179579428]
VLM(Vision-Language Models)は、Webスケールのコーパスを訓練し、自然画像のタスクに優れ、医療に利用されつつある。本稿では,8つのベンチマークを用いて,オープンソース汎用および医療専門のVLMの総合評価を行う。まず、大規模な汎用モデルは、いくつかのベンチマークで、すでに医学固有のモデルと一致しているか、あるいは超えている。第二に、推論のパフォーマンスは理解よりも一貫して低く、安全な意思決定支援にとって重要な障壁を強調します。
論文参考訳（メタデータ） (2025-07-15T11:12:39Z)
Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
既存の方法は、モデル構造を調整したり、高品質なデータで微調整したり、好みの微調整によって、医療ビジョン言語モデル(MedVLM)の性能を向上させることを目的としている。我々は,MedVLMと臨床専門知識の連携を図るために,Expert-Controlled-Free Guidance (Expert-CFG) という,ループ内のエキスパート・イン・ザ・ループフレームワークを提案する。
論文参考訳（メタデータ） (2025-07-12T09:03:30Z)
AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation [55.2739790399209]
本稿では,医療用LLMの質問応答能力を測定するために,13Bパラメータを用いたオープンソースの自動評価モデルAutoMedEvalを提案する。 AutoMedEvalの包括的な目的は、多様なモデルが生み出す応答の質を評価することであり、人間の評価への依存を著しく低減することを目的としている。
論文参考訳（メタデータ） (2025-05-17T07:44:54Z)
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning [34.93995619867384]
LLM(Large Language Models)は、既存の医学的質問応答ベンチマークで素晴らしいパフォーマンスを示している。 MedAgentsBenchは、多段階の臨床推論、診断の定式化、および治療計画シナリオを必要とする医学的問題に焦点を当てたベンチマークである。
論文参考訳（メタデータ） (2025-03-10T15:38:44Z)
Uncertainty Estimation of Large Language Models in Medical Question Answering [60.72223137560633]
大規模言語モデル(LLM)は、医療における自然言語生成の約束を示すが、事実的に誤った情報を幻覚させるリスクがある。医学的問合せデータセットのモデルサイズが異なる人気不確実性推定(UE)手法をベンチマークする。以上の結果から,本領域における現在のアプローチは,医療応用におけるUEの課題を浮き彫りにしている。
論文参考訳（メタデータ） (2024-07-11T16:51:33Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。