Fugu-MT 論文翻訳(概要): Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

論文の概要: Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

arxiv url: http://arxiv.org/abs/2603.19482v1
Date: Thu, 19 Mar 2026 21:23:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 19:48:38.897375
Title: Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following
Title（参考訳）: 医学教育後の大規模視覚言語モデルの指導自由チューニング
Authors: Myeongkyun Kang, Soopil Kim, Xiaoxiao Li, Sang Hyun Park,
Abstract要約: そこで本研究では,手書き命令への依存を軽減し,画像記述ペアのみを微調整に活用するインストラクションフリーチューニング手法を提案する。提案手法は,SKINCON,WBCAtt,CBIS,MIMIC-CXRデータセットにまたがる複数の視覚的質問応答タスクにおける最先端の精度を実現する。
参考スコア（独自算出の注目度）: 34.366091321340576
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model's over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.
Abstract（参考訳）: 大規模視覚言語モデル (LVLM) は、幅広いタスクにまたがって印象的な性能を示している。これらの機能は、主に画像インストラクション出力三重項からなるデータセットの微調整モデルである視覚的インストラクションチューニングに起因している。しかし,医学領域では専門知識を必要とするため,大規模かつ高品質な指導データセットの構築が特に困難である。そこで本研究では,手書き命令への依存を軽減し,画像記述ペアのみを微調整に活用する,命令不要なチューニング手法を提案する。具体的には、事前学習したLVLMの命令追従能力を保ちつつ、推論中に有効なパラメータの更新を促進させる、キュレートされたテキスト命令の代替として運動量プロキシ命令を導入する。これにより、微調整中に明示的な命令が存在しない場合でも、細調整されたLVLMはドメイン固有の命令に柔軟に対応できる。さらに、従来の単語に対するモデルの過度な依存を軽減するために、応答シャッフル戦略を導入し、より効果的な微調整を容易にする。提案手法は,SKINCON,WBCAtt,CBIS,MIMIC-CXRデータセットにまたがる複数の視覚的質問応答タスクにおける最先端の精度を実現し,医療領域におけるLVLMの微調整効率を著しく向上させる。

論文の概要: Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

関連論文リスト