Fugu-MT 論文翻訳(概要): Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks

論文の概要: Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks

arxiv url: http://arxiv.org/abs/2604.02748v1
Date: Fri, 03 Apr 2026 05:39:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 17:20:24.335737
Title: Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks
Title（参考訳）: 可視性脳MR画像タスクのための視覚的命令型言語モデル
Authors: Jonghun Kim, Sinyoung Ra, Hyunjin Park,
Abstract要約: LLaBIT(Large Language Model for Brain Image Translation)は、LLMの視覚的推論を脳MRI領域における臨床的に意味のあるタスクに拡張する。本手法は4つの異なるタスクにわたる5つの脳MRIデータセットで評価する。我々のモデルは、全てのタスクに対して優れた性能を示すだけでなく、直接比較において、専門的なタスク特化モデルよりも優れていた。
参考スコア（独自算出の注目度）: 1.4770902450080214
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: LLMs have demonstrated remarkable capabilities in linguistic reasoning and are increasingly adept at vision-language tasks. The integration of image tokens into transformers has enabled direct visual input and output, advancing research from image-to-text descriptions to text-to-image generation. However, simple text-to-image generation holds limited clinical utility. In medical imaging, tasks such as image segmentation for localizing pathologies or image translation for reconstructing missing sequences have much greater clinical importance. Despite this, integrating these diverse, clinically relevant tasks within a single, versatile language model remains unexplored. Our method, LLaBIT (Large Language Model for Brain Image Translation), extends the visual reasoning of LLMs to these clinically meaningful tasks in the brain MRI domain. To mitigate the spatial information loss inherent in image tokenization, we incorporate a mechanism to reuse feature maps from the image encoder, minimizing data degradation. We also generate text data using LLMs with strict predefined instructions to augment limited image-text paired data in brain MRI. We comprehensively evaluated our method on five brain MRI datasets across four distinct tasks: report generation, visual question answering, image segmentation, and image translation. Our model not only demonstrated superior performance across all tasks but also outperformed specialized, task-specific models in direct comparisons, highlighting its efficacy and versatility
Abstract（参考訳）: LLMは言語推論において顕著な能力を示しており、視覚言語タスクに適応している。画像トークンを変換器に統合することで、直接視覚的な入力と出力が可能になり、画像からテキストへの記述からテキストから画像への生成まで研究が進められている。しかし、単純なテキスト・ツー・イメージ生成は限られた臨床的有用性を持っている。医用画像では、病理組織を局在させるイメージセグメンテーションや、欠失配列を再構成する画像翻訳といったタスクが臨床的に重要である。それにもかかわらず、これらの多様で臨床的に関係のあるタスクを単一の多言語言語モデルに統合することは、まだ探索されていない。脳画像翻訳のためのLLaBIT (Large Language Model for Brain Image Translation) は,脳MRI領域におけるこれらの臨床的に意味のあるタスクにLLMの視覚的推論を拡張する。画像トークン化に固有の空間情報損失を軽減するため,画像エンコーダから特徴マップを再利用する機構を導入し,データの劣化を最小限に抑える。脳MRIにおける限られた画像とテキストのペアデータを増やすために、厳密な事前定義された命令を持つLSMを用いてテキストデータを生成する。我々は4つのタスク(レポート生成、視覚的質問応答、画像分割、画像翻訳)にまたがる5つの脳MRIデータセットの手法を総合的に評価した。我々のモデルは、全てのタスクに対して優れたパフォーマンスを示すだけでなく、直接比較においてタスク固有のモデルよりも優れており、その有効性と汎用性を強調している。

論文の概要: Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks

関連論文リスト