Fugu-MT 論文翻訳(概要): MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?

論文の概要: MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?

arxiv url: http://arxiv.org/abs/2603.23519v1
Date: Fri, 06 Mar 2026 12:17:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 02:36:13.005006
Title: MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?
Title（参考訳）: MedMT-Bench:LLMは医療シナリオにおける長期多段階会話を記憶・理解できるか?
Authors: Lin Yang, Yuancheng Yang, Xu Wang, Changkun Liu, Haihua Yang,
Abstract要約: 我々は、MedMT-Benchという医療用マルチターンインストラクションをベンチマークで紹介する。手動のエキスパート編集によって改良されたシーン・バイ・シーンデータ合成によるベンチマークを構築した。各テストケースは平均22ラウンド(52ラウンドのコンテキスト)で、5種類の難しい命令が続く。
参考スコア（独自算出の注目度）: 9.531847251088488
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across various specialist domains and have been integrated into high-stakes areas such as medicine. However, as existing medical-related benchmarks rarely stress-test the long-context memory, interference robustness, and safety defense required in practice. To bridge this gap, we introduce MedMT-Bench, a challenging medical multi-turn instruction following benchmark that simulates the entire diagnosis and treatment process. We construct the benchmark via scene-by-scene data synthesis refined by manual expert editing, yielding 400 test cases that are highly consistent with real-world application scenarios. Each test case has an average of 22 rounds (maximum of 52 rounds), covering 5 types of difficult instruction following issues. For evaluation, we propose an LLM-as-judge protocol with instance-level rubrics and atomic test points, validated against expert annotations with a human-LLM agreement of 91.94\%. We test 17 frontier models, all of which underperform on MedMT-Bench (overall accuracy below 60.00\%), with the best model reaching 59.75\%. MedMT-Bench can be an essential tool for driving future research towards safer and more reliable medical AI. The benchmark is available in https://openreview.net/attachment?id=aKyBCsPOHB&name=supplementary_material
Abstract（参考訳）: 大規模言語モデル(LLM)は、様々な専門分野にまたがる印象的な能力を示しており、医学のような高度な分野に統合されている。しかし、既存の医療関連ベンチマークでは、ロングコンテキストメモリ、干渉堅牢性、実際必要とされる安全防衛をストレステストすることは稀である。このギャップを埋めるために,MedMT-Benchを導入し,診断と治療の過程全体をシミュレートした医用マルチターン・インストラクションのベンチマークを行った。手動のエキスパート編集によって改良されたシーン・バイ・シーンのデータ合成によるベンチマークを構築し,実世界のアプリケーションシナリオと高度に整合した400のテストケースを生成する。各テストケースは平均22ラウンド(最大52ラウンド)で、5種類の難しい命令が続く。評価のために,LLM-as-judgeプロトコルをインスタンスレベルのルーリックとアトミックなテストポイントで提案する。我々は17のフロンティアモデルをテストし、いずれもMedMT-Bench(全精度が60.00\%未満)で性能が低く、最高のモデルは59.75\%に達した。 MedMT-Benchは、より安全で信頼性の高い医療AIに向けた将来の研究を推進するための重要なツールとなる。ベンチマークはhttps://openreview.net/attachment? id=aKyBCsPOHB&name=supplementary_ Materials

論文の概要: MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?

関連論文リスト