Fugu-MT 論文翻訳(概要): MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models

論文の概要: MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models

arxiv url: http://arxiv.org/abs/2606.06760v1
Date: Thu, 04 Jun 2026 22:54:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.473403
Title: MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models
Title（参考訳）: MedSIGHT:医療用大視野モデルにおける接地型視覚理解に向けて
Authors: Aofei Chang, Le Huang, Alex James Boyd, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, Cao Xiao,
Abstract要約: 我々は、Med-LVLMに基底的視覚的理解のための構造化されたピクセルレベルの理解を持たせる統一的なフレームワークであるMedSIGHTを提案する。 MedSIGHTは、領域中心のトークンを生成し、空間情報を言語モデルの表現空間に直接エンコードする新しいRerea Perceiverモジュールを導入した。さらに, LLM語彙に医学領域のコードブックを組み込むことにより, 解剖学的および病理学的領域の記号表現として離散領域のコードを生成することができる。
参考スコア（独自算出の注目度）: 42.44822236388223
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Medical large vision-language models (Med-LVLMs) have recently achieved remarkable progress in vision-language comprehension and medical image segmentation. However, existing models still struggle to unify these two capabilities, which is essential for achieving clinically reasoning that connects visual findings with semantic interpretation. We present MedSIGHT, a unified framework that equips Med-LVLMs with structured, pixel-level understanding for grounded visual comprehension. MedSIGHT introduces a novel Region Perceiver module that produces region-centric tokens, encoding spatial information directly into representation space of the language model. We further propose a medical region codebook into the LLM vocabulary, allowing the model to generate discrete region codes as symbolic representations of anatomical and pathological regions. These codes are decoded through the Region Perceiver to reconstruct segmentation mask, achieving end-to-end spatial grounding. Lastly, MedSIGHT combines Region Perceiver, Codebook and LLM using our proposed progressive training strategy to gradually aligns these modules stably. Trained on only 72K multimodal instruction pairs, MedSIGHT achieves state-of-the-art performance across diverse imaging modalities on both medical comprehension and segmentation tasks.
Abstract（参考訳）: 医用大規模視覚言語モデル (Med-LVLMs) は近年, 視覚言語理解と医用画像のセグメンテーションにおいて顕著な進歩を遂げている。しかし、既存のモデルはこれらの2つの能力の統合に苦慮しており、視覚所見と意味解釈を結び付ける臨床的推論を達成するのに不可欠である。我々は、Med-LVLMに基底的視覚的理解のための構造化されたピクセルレベルの理解を持たせる統一的なフレームワークであるMedSIGHTを提案する。 MedSIGHTは、領域中心のトークンを生成し、空間情報を言語モデルの表現空間に直接エンコードする新しいRerea Perceiverモジュールを導入した。さらに, LLM語彙に医学領域のコードブックを組み込むことにより, 解剖学的および病理学的領域の記号表現として離散領域のコードを生成することができる。これらのコードはRerea Perceiverを通じてデコードされ、セグメンテーションマスクを再構築し、エンドツーエンドの空間グラウンドを達成する。最後に、MedSIGHTは、提案したプログレッシブトレーニング戦略を用いて、Regional Perceiver、Codebook、LLMを組み合わせることで、これらのモジュールを安定的に整列させる。 72Kのマルチモーダル命令ペアで訓練されたMedSIGHTは、医療的理解とセグメンテーションの両方のタスクにおいて、様々な画像モダリティにわたる最先端のパフォーマンスを達成する。

論文の概要: MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models

関連論文リスト